NSX-T cluster is unavailable due to datastore outage and deactivate cluster

Products

VMware NSX

Issue/Introduction

You have had a datastore outage on the ESXi's hosts where the NSX-T managers reside.

To recover the cluster you used the 'deactivate cluster' command and now you have a single node cluster.
The command 'get cluster status' shows everything up except: CORFU_NONCONFIG which is in an UNKOWN state:

Group Type: CORFU_NONCONFIG
Group Status: UNAVAILABLE
Members:
UUID FQDN IP STATUS
206dea88-####-####-####-###########de nsx-03.example.local 192.168.1.203 UNKNOWN
/nonconfig/corfu/corfu/LAYOUT_CURRENT.ds - shows the other 2 nodes are still present.
"layoutServers": [
"192.168.1.201:9040", ##########this node was removed
"192.168.1.202:9040", ##########this node was removed
"192.168.1.203:9040"
],
"sequencers": [
"192.168.1.201:9040", ##########this node was removed
"192.168.1.202:9040", ##########this node was removed
"192.168.1.203:9040"
],
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": -1,
"stripes": [
{
"logServers": [
"192.168.1.202:9040", ##########this node was removed
"192.168.1.201:9040" ##########this node was removed
]
}
]
}
],
"unresponsiveServers": [
"192.168.1.203:9040"
],
"epoch": 1825,
"clusterId": "ff44209a-####-####-####-##########86"

When you run the command 'get cluster config' on the remaining node, it correctly shows only the single remaining node.
In the following log we can see a loop looking for the now detached nodes: /var/log/corfu-nonconfig/corfu.9040.log

2022-05-31T14:59:51.264Z | DEBUG | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Layout server 192.168.1.203:9040 responded with layout Layout(layoutServers=[192.168.1.201:9040, 192.168.1.202:9040, 192.168.1.203:9040], sequencers=[192.168.1.202:9040, 192.168.1.201:9040, 192.168.1.203:9040], segments=[Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=0, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.1.202:9040, 192.168.1.201:9040])])], unresponsiveServers=[192.168.1.203:9040], epoch=1725, clusterId=ff44209a-####-####-####-##########86)
2022-05-31T14:59:51.264Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery layout epoch:1725, Cluster epoch: 1725
2022-05-31T14:59:51.264Z | ERROR | initializationTaskThread | o.c.i.ManagementAgent | initializationTask: Recovery failed 1364 times. Retrying in PT1Ss.
2022-05-31T14:59:52.262Z | DEBUG | client-5 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...
2022-05-31T14:59:52.265Z | DEBUG | initializationTaskThread | o.c.runtime.view.RuntimeLayout | Requested move of servers to new epoch 1726 servers are [192.168.1.203:9040, 192.168.1.202:9040, 192.168.1.201:9040]
2022-05-31T14:59:52.265Z | INFO | initializationTaskThread | o.c.runtime.clients.BaseClient | sealRemoteServer: send SEAL from me(clientId=null) to new epoch 1726
...
2022-05-31T14:59:52.464Z | DEBUG | client-6 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.201:9040]: Channel connection failed, reconnecting...
...
2022-05-31T14:59:53.265Z | DEBUG | initializationTaskThread | o.c.r.v.QuorumFuturesFactory | QuorumGet: Exception TimeoutException
2022-05-31T14:59:53.265Z | ERROR | initializationTaskThread | o.c.r.v.LayoutManagementView | Error: recovery: {}
org.corfudb.runtime.exceptions.QuorumUnreachableException: Couldn't reach quorum, reachable=1, required=2
at
...
2022-05-31T14:59:53.265Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery reconfiguration attempt result: false
2022-05-31T14:59:53.763Z | DEBUG | client-7 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...
2022-05-31T14:59:53.766Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 192.168.1.202:9040 but failed by timeout

In the following log we also see timeout requests to the now detached nodes: /var/log/corfu-nonconfig/nonconfig-corfu-compactor-audit.log

2022-05-31T14:11:10.601Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.201:9040 but failed by timeout
2022-05-31T14:11:11.102Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.202:9040 but failed by timeout
2022-05-31T14:11:31.203Z ERROR main UfoCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP2" level="ERROR" subcomp="corfu-compactor"] UFO: Trim failed for ufo data in namespace ufo
org.corfudb.runtime.exceptions.UnreachableClusterException: Cluster is unavailable
at com.vmware.nsx.platform.ufo.CorfuRuntimeHelper$1.run(CorfuRuntimeHelper.java:43) ...

Environment

VMware NSX-T Data Center

Cause

This issue happens due to the underlying datastore issue and then issuing the 'deactivate cluster' command to attempt to recover the cluster.
The node where the deactivate cluster command was executed, was an unresponsive nodes at the time.

As this node was in an unresponsive state when the deactivate cluster command was issued, it was unhealthy even before the deactivate cluster command was issued.
This node needs to cure itself. However, to cure itself, it needs information from the other two nodes.
As the other two nodes no longer exist, this prevents the remaining node from being able to cure itself and therefore the cluster is down.

The loop is caused by this remaining unhealthy node trying to continually connect to the other 2 nodes to cure itself.

Best practice when a datastore issue causes impact to all 3 managers, is to restore from backup to a point in time before the datastore issue occurred.

Resolution

This is a known issue impacting NSX-T Datacenter when all three managers nodes are on the same datastore.

Workaround:
Restore the NSX-T cluster from backup, to a point before the datastore outage occurred: Backup restore guide