You have had a datastore outage on the ESXi's hosts where the NSX-T managers reside.
- To recover the cluster you used the 'deactivate cluster' command and now you have a single node cluster.
- The command 'get cluster status' shows everything up except: CORFU_NONCONFIG which is in an UNKOWN state:
Group Type: CORFU_NONCONFIG
Group Status: UNAVAILABLE
Members:
UUID FQDN IP STATUS
206dea88-####-####-####-###########de nsx-03.example.local 192.168.1.203 UNKNOWN
/nonconfig/corfu/corfu/LAYOUT_CURRENT.ds - shows the other 2 nodes are still present.
"layoutServers": [
"192.168.1.201:9040", ##########this node was removed
"192.168.1.202:9040", ##########this node was removed
"192.168.1.203:9040"
],
"sequencers": [
"192.168.1.201:9040", ##########this node was removed
"192.168.1.202:9040", ##########this node was removed
"192.168.1.203:9040"
],
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": -1,
"stripes": [
{
"logServers": [
"192.168.1.202:9040", ##########this node was removed
"192.168.1.201:9040" ##########this node was removed
]
}
]
}
],
"unresponsiveServers": [
"192.168.1.203:9040"
],
"epoch": 1825,
"clusterId": "ff44209a-####-####-####-##########86"
- When you run the command 'get cluster config' on the remaining node, it correctly shows only the single remaining node.
- In the following log we can see a loop looking for the now detached nodes: /var/log/corfu-nonconfig/corfu.9040.log
2022-05-31T14:59:51.264Z | DEBUG | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Layout server 192.168.1.203:9040 responded with layout Layout(layoutServers=[192.168.1.201:9040, 192.168.1.202:9040, 192.168.1.203:9040], sequencers=[192.168.1.202:9040, 192.168.1.201:9040, 192.168.1.203:9040], segments=[Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=0, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.1.202:9040, 192.168.1.201:9040])])], unresponsiveServers=[192.168.1.203:9040], epoch=1725, clusterId=ff44209a-####-####-####-##########86)
2022-05-31T14:59:51.264Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery layout epoch:1725, Cluster epoch: 1725
2022-05-31T14:59:51.264Z | ERROR | initializationTaskThread | o.c.i.ManagementAgent | initializationTask: Recovery failed 1364 times. Retrying in PT1Ss.
2022-05-31T14:59:52.262Z | DEBUG | client-5 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...
2022-05-31T14:59:52.265Z | DEBUG | initializationTaskThread | o.c.runtime.view.RuntimeLayout | Requested move of servers to new epoch 1726 servers are [192.168.1.203:9040, 192.168.1.202:9040, 192.168.1.201:9040]
2022-05-31T14:59:52.265Z | INFO | initializationTaskThread | o.c.runtime.clients.BaseClient | sealRemoteServer: send SEAL from me(clientId=null) to new epoch 1726
...
2022-05-31T14:59:52.464Z | DEBUG | client-6 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.201:9040]: Channel connection failed, reconnecting...
...
2022-05-31T14:59:53.265Z | DEBUG | initializationTaskThread | o.c.r.v.QuorumFuturesFactory | QuorumGet: Exception TimeoutException
2022-05-31T14:59:53.265Z | ERROR | initializationTaskThread | o.c.r.v.LayoutManagementView | Error: recovery: {}
org.corfudb.runtime.exceptions.QuorumUnreachableException: Couldn't reach quorum, reachable=1, required=2
at
...
2022-05-31T14:59:53.265Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery reconfiguration attempt result: false
2022-05-31T14:59:53.763Z | DEBUG | client-7 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...
2022-05-31T14:59:53.766Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 192.168.1.202:9040 but failed by timeout
- In the following log we also see timeout requests to the now detached nodes: /var/log/corfu-nonconfig/nonconfig-corfu-compactor-audit.log
2022-05-31T14:11:10.601Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.201:9040 but failed by timeout
2022-05-31T14:11:11.102Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.202:9040 but failed by timeout
2022-05-31T14:11:31.203Z ERROR main UfoCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP2" level="ERROR" subcomp="corfu-compactor"] UFO: Trim failed for ufo data in namespace ufo
org.corfudb.runtime.exceptions.UnreachableClusterException: Cluster is unavailable
at com.vmware.nsx.platform.ufo.CorfuRuntimeHelper$1.run(CorfuRuntimeHelper.java:43) ...