- NSX manger node has DATASTORE service DOWN
- When we do from NSX cli --> get cluster status --> Shows DATASTORE service as DOWN for a particular manager node
nsx-mgr> get cluster status
Wed Jan 21 2025 UTC 07:06:27.343
Cluster Id: #######-########-#######
Overall Status: DEGRADED
Group Type: DATASTORE
Group Status: DEGRADED
Members:
UUID FQDN IP STATUS
#######-########-####### nsx-mgr1 x.x.x.x DOWN
#######-########-####### nsx-mgr2 x.x.x.x UP
#######-########-####### nsx-mgr3 x.x.x.x UP
- Check for /config partition utilization.
df -h (In this case if the managers are having /config usage in low single digits then it should be good)
- Next is to verify the LAYOUT_CURRENT.ds file, we can execute using: cat /config/corfu/LAYOUT_CURRENT.ds on all managers that are in good state, we can see that this 3rd manager node is unresponsive section on port 9000 and segment section shows the 2 managers in the "logServers" brackets with a start of 0 and end of -1, then it means all 2 nodes have full segment visibility on all sequence address spaces which are where the storage units are mapped to and not the 3rd node
root@nsx-mngr-01:~# cat /config/corfu/LAYOUTS_CURRENT.ds
{
"layoutServers": [
"#.#.#.1:9000",
"#.#.#.2:9000",
"#.#.#.3:9000"
],
"sequencers": [
"#.#.#.1:9000",
"#.#.#.2:9000",
"#.#.#.3:9000"
],
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": -1,
"stripes": [
{
"logServers": [
"#.#.#.1:9000",
"#.#.#.3:9000"
]
}
]
}
],
"unresponsiveServers": [
"#.#.#.3:9000"
],
"epoch": 2365,
"clusterId": "#######-#######-############"
}
- Any node(s) showing up in the "unresponsiveServers" section should have the "service corfu-server status" checked to make sure the corfu-server service is running. --> service corfu-server status
If corfu-server service is stopped, start the service with "service corfu-server start"
Watch the corfu.9000.log with "tail -F /var/log/corfu/corfu.9000.log" and wait for the service to initialize.
It may take several moments for the service to start and the logging to scroll. Look for stack traces that may come up in the log if the corfu-server service crashes or fails to start..
VMware NSX
VMware NSX-T
A data corruption on a manager node can cause the datastore issues on a manager node. This could be caused due to various factors, one of it is underlying storage issues or file system errors on that VM.
Workaround:
1. First reboot the NSX manager node that has the datastore issues
2. If its still the same after reboot, check the file system on that manager node and make sure its clean by performing the file system corrections using this KB: https://knowledge.broadcom.com/external/article?articleNumber=320303
3. If the status of DB is still down, verify if the corfu service is up and running: service corfu-server status, if not please start the service: service corfu-server start and verify
4. If the issue still remains, verify the CURRENT_LAYOUT.ds (on all the managers that has all services up and running) and verify the segment if this faulty node has the complete DB: cat /config/corfu/LAYOUT_CURRENT.ds. Segment section shows the 2 managers in the "logServers" brackets with a start of 0 and end of -1, then it means all 2 nodes have full segment visibility on all sequence address spaces which are where the storage units are mapped to and not the 3rd node
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": -1,
"stripes": [
{
"logServers": [
"#.#.#.1:9000",
"#.#.#.3:9000"
]
}
]
}
],
"unresponsiveServers": [
"#.#.#.3:9000"
],
5. The 3rd node from the cluster does not have full cluster visibility as shown above and needs to be removed
6. To remove a single manager node from the CorfuDB cluster:
Run "get cluster config" or "get nodes" to identify the uuid of the 3 Manager nodes
From a previously identified good node detach the bad Manager
NSX_MGR01> detach node <node_UUID>
Confirm it has been removed from the cluster using "get cluster status/config".
Power off and delete the VM.
If the detach node command fails, it usually means there is an issue with the cluster boot manager.Check /var/log/cbm/cbm.log for errors around the string "detach"
7. We can deploy the 3rd manager node from the manager UI using this doc: https://techdocs.broadcom.com/us/en/vmware-cis/nsx/vmware-nsx/4-1/installation-guide/installing-nsx-manager-cluster-on-vsphere/install-nsx-manager-and-available-appliances/deploy-nsx-manager-nodes-to-form-a-cluster-using-ui.html
If the suggested workaround steps do not resolve the issue, please consider submitting a support case to Broadcom. Kindly include the error screenshot or details, along with all the NSX manager log bundles for further assistance.