Corfu/Corfu-nonconfig status is not in UP state (its either in DOWN/DEGRADED/UNKNOWN state)

search cancel

Corfu/Corfu-nonconfig status is not in UP state (its either in DOWN/DEGRADED/UNKNOWN state)

book

Article ID: 386799

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Corfu/Corfu-nonconfig node status is in DOWN/DEGRADED state in the NSX Manager UI

Corfu/Corfu-nonconfig node status is in UNKNOWN state in the NSX Manager UI

Environment

VMware NSX
VCF 9.0 and above

Resolution

Do a rolling reboot of managers, if its still showing down, please check the following health report of managers
If Corfu/Corfu-nonconfig are in a DOWN state, you need to check the complete health report
Note: You need to access the NSX Manager node as the root user via SSH or via the console of the virtual machine

For Corfu: grep -i "health report" -A 75 /var/log/corfu/corfu.9000.logFor Corfu-nonconfig: grep -i "health report" -A 75 /var/log/corfu-nonconfig/corfu.9040.logYou will see a health report's status and reason fields similar to the following:
```
{
  "status": "DOWN",
  "reason": "Some of the services are not initialized",
  ...
```
Following this, you will see an initialization list:
```
"init": [
    {
      "name": "Layout Server",
      "status": "UP",
      "reason": "Initialization successful"
    },
    {
      "name": "Sequencer",
      "status": "UP",
      "reason": "Initialization successful"
    },
    {
      "name": "Clustering Orchestrator",
      "status": "UP",
      "reason": "Initialization successful"
    },
    ...
```
This list will contain six different components: Layout Server, Sequencer, Clustering Orchestrator, Log Unit, Compactor, Failure Detector.
If any of these components have failed to start, it will be reflected in the status and reason in the init list of this component and the overall corfu status and reason.

The following is an example of a health report where Corfu comes up but the sequencer server is not starting correctly:
```
{
  "status": "DOWN",
  "reason": "Some of the services are not initialized",
  "init": [
    {
      "name": "Log Unit",
      "status": "UP",
      "reason": "Initialization successful"
    },
    {
      "name": "Layout Server",
      "status": "UP",
      "reason": "Initialization successful"
    },
    {
      "name": "Clustering Orchestrator",
      "status": "UP",
      "reason": "Initialization successful"
    },
    {
      "name": "Failure Detector",
      "status": "UP",
      "reason": "Initialization successful"
    },
    {
      "name": "Sequencer",
      "status": "DOWN",
      "reason": "Service is not initialized"
    }
  ],
```
1. If Log Unit is DOWN:
  
  Check for Data Corruption Issues:
  
  For Corfu: grep -i "DataCorruptionException" /var/log/corfu/corfu.9000.logFor Corfu-nonconfig: grep -i "DataCorruptionException" /var/log/corfu-nonconfig/corfu.9040.logIf you see results from these commands, use one of the following KB articles to resolve the issue:
  NSX-T Manager service corfu-nonconfig-server is not running
  Corfu data file corruption seen in corfu.9000.log: "Checksum mismatch detected while trying to read file" or "Can't parse metadata"
  If you don't see these issues or require further assistance, open a case with Broadcom Support.
2. If Layout Server is DOWN:
  
  Check that /config/corfu directory (for Corfu) and /nonconfig/corfu directory (for Corfu-non config) are writable and have sufficient free disk space.
  
  If either directory is not writable or low on space, open a case with Broadcom Support.
3. Clustering Orchestrator, Failure Detector, Sequencer or Compactor are DOWN, open a case with Broadcom Support.

3.If Corfu-nonconfig are in a DOWN state,

If you observe the following log signatures within your environment, the root cause is an underlying infrastructure issue (such as storage latency, network drops, or CPU starvation) rather than a software defect. These infrastructure bottlenecks cause the system to enter a degraded state, leading to temporary service recovery only when the JVMs exhaust memory and reboot.

a. Status of all group except non-config Corfu is up:

grep "group_status" ./nsx_manager_#######/clustering.json

"group_status": "UNAVAILABLE"

"group_status": "UNAVAILABLE",
"group_type": "CORFU_NONCONFIG",
"leaders": [],

c.Storage I/O Errors and Read-Only Filesystems: The kernel logs are flooded with IO errors, forcing the filesystem to remount as read-only.

 
kernel - - - [3533505.781733] EXT4-fs (dm-8): Remounting filesystem read-only
kernel - - - [3533505.782452] EXT4-fs error (device dm-8): ext4_journal_check_start:83: comm chown: Detected aborted journal

d. CPU Starvation and Soft Lockups:

Plaintext 
watchdog: BUG: soft lockup - CPU#9 stuck for 103s! [failAfter-0:4101]
site-manager Wrapper Process has not received any CPU time for 150 seconds. Extending timeouts.

e. JVM Out of Memory (OOM) Restarts: The system may appear to recover when the JVM hits an OOM state and forcefully restarts connectivity.

 
INFO | jvm 2 | java.lang.OutOfMemoryError: Compressed class space
INFO | jvm 3 | Welcome to CORFU SERVER

f. Network Packet Loss:

3 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2032ms

Action Plan

As these symptoms indicate severe underlying infrastructure constraints across the nodes, perform the following actions:

Perform a rolling reboot of the nodes.
If the issue persists after the reboot, open a support case with Broadcom for further investigation.

4. If Corfu/Corfu-nonconfig are in a DEGRADED state, you need to check the complete health report
Note: You need to access the NSX Manager node as the root user via SSH or via the console of the virtual machine

For Corfu: grep -i "health report" -A 75 /var/log/corfu/corfu.9000.log
For Corfu-nonconfig: grep -i "health report" -A 75 /var/log/corfu-nonconfig/corfu.9040.log

You will see a health report's status and reason fields similar to the following:

{
  "status": "FAILURE",
  "reason": "Some of the services experience runtime health issues",
  ...

Following this, you will see an initialization list:

"runtime": [
    {
      "name": "Log Unit",
      "status": "UP",
      "reason": "Up and running"
    },
    {
      "name": "Layout Server",
      "status": "UP",
      "reason": "Up and running"
    },
    {
      "name": "Clustering Orchestrator",
      "status": "UP",
      "reason": "Up and running"
    },
    ...

This list will contain six different components: Layout Server, Sequencer, Clustering Orchestrator, Log Unit, Compactor, Failure Detector.
If any of these components encounters an issue during runtime, it will be reflected in the status and reason in this component's runtime list and the overall corfu status and reason.

The following is an example of a health report where the sequencer server is experiencing runtime health issues:

{
  "status": "FAILURE",
  "reason": "Some of the services experience runtime health issues",
  ...
  "runtime": [
    {
      "name": "Log Unit",
      "status": "UP",
      "reason": "Up and running"
    },
    {
      "name": "Layout Server",
      "status": "UP",
      "reason": "Up and running"
    },
    {
      "name": "Clustering Orchestrator",
      "status": "UP",
      "reason": "Up and running"
    },
    {
      "name": "Failure Detector",
      "status": "UP",
      "reason": "Up and running"
    },
    {
      "name": "Sequencer",
      "status": "FAILURE",
      "reason": "Sequencer requires bootstrap"
    }

Note: Layout Server cannot enter a DEGRADED state

1. If Log Unit is in FAILURE, you will see a reason: "Quota exceeded":
  This error indicates that corfu servers are running out of disk space. Once you see it, only the writes with high priority will go through and all others will be aborted with QuotaExceededException.
  If the compactor is up and running in the health monitor logs, eventually the disk space will get reclaimed and this error will go away.
  If the compactor is not running for more than 30 minutes ( "Last compaction cycle failed".), follow the instructions in the following articles:
  
  NSX Manager cluster degraded and UI inaccessible/Compactor running Out Of Memory
  NSX Manager cluster intermittently degraded due to Proton or Compactor running Out Of Memory
  If you require further assistance, open a case with Broadcom Support.
2. If Sequencer is in FAILURE, you will see a message: "Sequencer requires bootstrap".
  This is typically a temporary condition. Wait for 10 minutes or restart the Corfu servers.
  
  For corfu: /etc/init.d/corfu-server restart
  For Corfu-nonconfig: /etc/init.d/corfu-nonconfig-server restart
  If the issue persists, open a case with Broadcom Support.
3. If Clustering Orchestrator or Failure Detector are in FAILURE, wait for 10 minutes or restart the corfu servers.
  
  For corfu: /etc/init.d/corfu-server restart
  For Corfu-nonconfig: /etc/init.d/corfu-nonconfig-server restart
  If the issue persists, open a case with Broadcom Support.
4. If Compactor is in FAILURE, you will see a reason "Last compaction cycle failed".
  Follow the instructions in the following articles:
  
  NSX Manager cluster degraded and UI inaccessible/Compactor running Out Of Memory
  NSX Manager cluster intermittently degraded due to Proton or Compactor running Out Of Memory
  If the issue persists, open a case with Broadcom Support.

Additional Information

If any of the above scenarios are encountered, gather following information before engaging Broadcom support: [For more information, see Creating and managing Broadcom support cases.]

NSX Manager support bundles
ESXi host support bundles for hosts that are failing to configure as transport nodes.
Text of any error messages seen in NSX GUI or command lines pertinent to the investigation

See Collect Support Bundles and Uploading files to cases on the Broadcom Support Portal for more information

Feedback

thumb_up Yes

thumb_down No