NSX Manager nodes fail to load, reporting "error code 101." and or "Some appliance components are not functioning properly"
search cancel

NSX Manager nodes fail to load, reporting "error code 101." and or "Some appliance components are not functioning properly"

book

Article ID: 416144

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

In an NSX environment, the NSX Manager cluster may become unstable with most or all system services showing as Unavailable. Administrators may observe that general commands executed on the NSX Managers (for example, get service) either hang or return with no output.
The NSX Manager UI may also be inaccessible or highly unresponsive.

Additionally, communication failures between NSX Managers and Edge nodes may occur, impacting control plane connectivity and management operations.

Discrepancies may be seen in the layout epoch numbers and synchronization state among NSX Manager nodes when checking the following file on each node:

# less /config/corfu/LAYOUT_CURRENT.ds

The Corfu log file (/var/log/corfu/corfu.9000.log) on one or more NSX Managers may also report data corruption errors similar to:

 
<Time-stamp> | ERROR | WrapperSimpleAppMain | ###### | CorfuServer: Server exiting due to unrecoverable error: org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file

Environment

VMware NSX 4.x

Cause

Corruption within the Corfu data store can lead to synchronization failures and instability across the NSX Manager cluster. This corruption may occur due to underlying storage or network-related issues that impact data consistency and communication between cluster nodes.

Resolution

Scenario 1:

If only one NSX Manager node shows signs of Corfu data corruption, refer to the KB article Corfu data file corruption seen in corfu.9000.log for detailed recovery steps. In such cases, the resolution typically involves detaching the corrupted node from the cluster, then performing a shutdown and deletion of the affected node.

Scenario 2:

If only one out of three NSX Manager nodes remains healthy, you can proceed with the 'deactivate cluster' command from the healthy node. Ensure that all corrupted nodes are powered off before running this command. Refer to the associated NSX Manager cluster is DOWN or UNAVAILABLE if all nodes part of the the NSX Manager cluster is down or majority nodes are down for step-by-step guidance on this process.

Scenario 3:

Verify whether the issue aligns with any known problems, such as the JDK bug described in KB NSX is Impacted by JDK-8330017.