Datastore is Down on one of the nodes in the NSX manager cluster
search cancel

Datastore is Down on one of the nodes in the NSX manager cluster

book

Article ID: 387930

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

- NSX manger node has DATASTORE service DOWN

- When we do from NSX cli --> get cluster status --> Shows DATASTORE service as DOWN for a particular manager node

nsx-mgr> get cluster status
Wed Jan 21 2025 UTC 07:06:27.343
Cluster Id: #######-########-#######
Overall Status: DEGRADED

Group Type: DATASTORE
Group Status: DEGRADED

Members:
   UUID FQDN IP STATUS
   #######-########-####### nsx-mgr1 x.x.x.x DOWN
   #######-########-####### nsx-mgr2 x.x.x.x UP
   #######-########-####### nsx-mgr3 x.x.x.x UP

- Check for /config partition utilization.
  df -h  (In this case if the managers are having /config usage in low single digits then it should be good)

- Next is to verify the LAYOUT_CURRENT.ds file, we can execute using: cat /config/corfu/LAYOUT_CURRENT.ds on all managers that are in good state, we can see that this 3rd manager node is unresponsive section on port 9000 and segment section shows the 2 managers in the "logServers" brackets with a start of 0 and end of -1, then it means all 2 nodes have full segment visibility on all sequence address spaces which are where the storage units are mapped to and not the 3rd node

root@nsx-mngr-01:~# cat /config/corfu/LAYOUTS_CURRENT.ds
{
"layoutServers": [
"#.#.#.1:9000",
"#.#.#.2:9000",
"#.#.#.3:9000"
],
"sequencers": [
"#.#.#.1:9000",
"#.#.#.2:9000",
"#.#.#.3:9000"
],
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": -1,
"stripes": [
{
"logServers": [
"#.#.#.1:9000",
"#.#.#.3:9000"
]
}
]
}
],
"unresponsiveServers": [
"#.#.#.3:9000"
],
"epoch": 2365,
"clusterId": "#######-#######-############"
}

- Any node(s) showing up in the "unresponsiveServers" section should have the "service corfu-server status" checked to make sure the corfu-server service is running. --> service corfu-server status

If corfu-server service is stopped, start the service with "service corfu-server start"

Watch the corfu.9000.log with "tail -F /var/log/corfu/corfu.9000.log" and wait for the service to initialize.

It may take several moments for the service to start and the logging to scroll. Look for stack traces that may come up in the log if the corfu-server service crashes or fails to start..

Environment

VMware NSX

VMware NSX-T

Cause

A data corruption on a manager node can cause the datastore issues on a manager node. This could be caused due to various factors, one of it is underlying storage issues or file system errors on that VM.

Resolution

Workaround:

1. First reboot the NSX manager node that has the datastore issues

2. If its still the same after reboot, check the file system on that manager node and make sure its clean by performing the file system corrections using this KB: https://knowledge.broadcom.com/external/article?articleNumber=320303

3. If the status of DB is still down, verify if the corfu service is up and running: service corfu-server status, if not please start the service: service corfu-server start and verify

4. If the issue still remains, verify the CURRENT_LAYOUT.ds (on all the managers that has all services up and running) and verify the segment if this faulty node has the complete DB: cat /config/corfu/LAYOUT_CURRENT.ds. Segment section shows the 2 managers in the "logServers" brackets with a start of 0 and end of -1, then it means all 2 nodes have full segment visibility on all sequence address spaces which are where the storage units are mapped to and not the 3rd node

"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": -1,
"stripes": [
{
"logServers": [
"#.#.#.1:9000",
"#.#.#.3:9000"
]
}
]
}
],
"unresponsiveServers": [
"#.#.#.3:9000"
],

5. The 3rd node from the cluster does not have full cluster visibility as shown above and needs to be removed

6. To remove a single manager node from the CorfuDB cluster: 

Run "get cluster config" or "get nodes" to identify the uuid of the 3 Manager nodes

From a previously identified good node detach the bad Manager

NSX_MGR01> detach node <node_UUID>

Confirm it has been removed from the cluster using "get cluster status/config".

Power off and delete the VM.

If the detach node command fails, it usually means there is an issue with the cluster boot manager.Check /var/log/cbm/cbm.log for errors around the string "detach"

7. We can deploy the 3rd manager node from the manager UI using this doc: https://techdocs.broadcom.com/us/en/vmware-cis/nsx/vmware-nsx/4-1/installation-guide/installing-nsx-manager-cluster-on-vsphere/install-nsx-manager-and-available-appliances/deploy-nsx-manager-nodes-to-form-a-cluster-using-ui.html

 

Additional Information

If the suggested workaround steps do not resolve the issue, please consider submitting a support case to Broadcom. Kindly include the error screenshot or details, along with all the NSX manager log bundles for further assistance.