Datastore is Down on one of the nodes in the NSX manager cluster

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

NSX manager node has DATASTORE service DOWN

The following error may be seen during the NSX upgrade precheck

"Datastore cluster is not stable. Please resolve any issues with datastore cluster and retry. Group type : DATASTORE, status : DEGRADED."

When executing the following command from NSX cli as "admin" user, output shows that DATASTORE service as DOWN for a particular manager node:

# get cluster status

NSX-FQDN/Hostname> get cluster status
Wed Jan 21 2025 UTC 07:06:27.343
Cluster Id: #######-########-#######
Overall Status: DEGRADED

Group Type: DATASTORE
Group Status: DEGRADED

Members:
UUID FQDN IP STATUS
#######-########-####### <NSX-FQDN1/Hostname> x.x.x.x DOWN
#######-########-####### <NSX-FQDN2/Hostname> x.x.x.x UP
#######-########-####### <NSX-FQDN3/Hostname> x.x.x.x UP

You may see unresponsive servers listed in the output of the following command as "root" user on the NSX Manager (Switch to root account from admin using command "st en" and provide the root password):

# cat /config/corfu/LAYOUT_CURRENT.ds

root@<NSX-FQDN1/Hostname>:~# cat /config/corfu/LAYOUT_CURRENT.ds
{
"layoutServers": [
"#.#.#.1:9000",
"#.#.#.2:9000",
"#.#.#.3:9000"
],
"sequencers": [
"#.#.#.1:9000",
"#.#.#.2:9000",
"#.#.#.3:9000"
],
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": -1,
"stripes": [
{
"logServers": [
"#.#.#.1:9000",
"#.#.#.2:9000"
]
}
]
}
],
"unresponsiveServers": [
"#.#.#.3:9000"
],
"epoch": 2365,
"clusterId": "#######-#######-############"
}

Environment

VMware NSX-T Data Center
VMware NSX

Cause

Data corruption on a Manager node can cause the datastore issues to be reported in NSX. The underlying cause for the corruption could be caused due to various factors, such as underlying storage issues or file system errors on that Manager VM.

Resolution

First, reboot the NSX manager node(s) that report datastore issues and reassess to see if the error remains.
If no change, then check for /config partition utilization using the below command on each Manager node:
# df -h
NOTE: The managers should report /config usage in low or single-digits
If there is no change following the reboot(s) and disk utilization seems normal, check the file system on that manager node and make sure its clean by performing the file system corrections using this Knowledge Base article: NSX Manager fails to boot up with error "UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY."
Verify the LAYOUT_CURRENT.ds file with the following on all managers:
# cat config/corfu/LAYOUT_CURRENT.ds

root@<NSX-FQDN1/Hostname>:~# cat /config/corfu/LAYOUT_CURRENT.ds { "layoutServers": [ "#.#.#.1:9000", "#.#.#.2:9000", "#.#.#.3:9000" ], "sequencers": [ "#.#.#.1:9000", "#.#.#.2:9000", "#.#.#.3:9000" ], "segments": [ { "replicationMode": "CHAIN_REPLICATION", "start": 0, "end": -1, "stripes": [ { "logServers": [ "#.#.#.1:9000", "#.#.#.2:9000" ] } ] } ], "unresponsiveServers": [ "#.#.#.3:9000" ], "epoch": 2365, "clusterId": "#######-#######-############" }

NOTE: In the above output, the 3rd manager node is listed as unresponsive on port 9000 (seen in the unresponsiveServers section). Further, the segments section shows the 2 managers listed in the "logServers" brackets have a start value of 0 and end of -1; this indicates the 2 nodes have full segment visibility on all sequence address spaces (note that this is where the storage units are mapped to). However, the 3rd node is not listed and therefore lacks this visibility.

- Any node(s) showing up in the "unresponsiveServers" section should have the "service corfu-server status" checked to make sure the corfu-server service is running with "service corfu-server status".
  - If corfu-server service is stopped, start the service manually with "service corfu-server start"
  - Watch the corfu.9000.log with "tail -F /var/log/corfu/corfu.9000.log" and wait for the service to initialize.
    NOTE: It may take several moments for the service to start and the logging to scroll. Look for stack traces that may come up in the log if the corfu-server service crashes or fails to start.
- The 3rd node from the cluster does not have full cluster visibility as shown above and needs to be removed
  - To remove a single manager node from the CorfuDB cluster:
    - Run "get cluster config" or "get nodes" to identify the uuid of the 3 Manager nodes
    - From a previously identified good node detach the bad Manager node with the following:
      # <NSX-FQDN1/Hostname> > detach node <node_UUID>
    - Confirm it has been removed from the cluster using "get cluster status/get cluster config".
    - If the detach node command fails, it usually means there is an issue with the cluster boot manager. Check /var/log/cbm/cbm.log for errors around the string "detach"
    - Power off and delete the Manager VM.
    - We can deploy the 3rd manager node from the manager UI using the following (please ensure correct NSX version is selected from the dropdown): Deploy NSX Manager Nodes to Form a Cluster from the UI

Additional Information

You can also validate the symptoms documented under this article to confirm the corfu data corruption.
If the suggested workaround steps do not resolve the issue, please consider submitting a support case to Broadcom. Kindly include the error screenshot or details, what troubleshooting has already been performed along with all the NSX manager log bundles for further assistance, refer Creating and managing Broadcom support cases