NSX Transport nodes are intermittently shown as down in NSX Management

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

NSX Management plane reports some NSX Transport nodes intermittently down.
ESXi and/or Edge Transport nodes are affected.
The connection status resolves itself without action.
On NSX Managers, log lines similar to the below are encountered in /var/log/cloudnet/nsx-ccp.log

WARN netty-3 ClientResponseHandler 1597 Server threw exception for SERVER_ERROR with request_id: 171726533
WARN org.corfudb.runtime.collections.streaming.StreamPollingScheduler AbstractView 1597 Got a wrong epoch exception, updating epoch to 2240 and invalidate view
INFO org.corfudb.runtime.collections.streaming.StreamPollingScheduler AbstractView 1597 layoutHelper: Retried 0 times, SystemDownHandlerTriggerLimit = 20
[...]
WARN org.corfudb.runtime.collections.streaming.StreamPollingScheduler AbstractView 1597 Server still not ready. Waiting for server to start accepting requests.
INFO org.corfudb.runtime.collections.streaming.StreamPollingScheduler AbstractView 1597 layoutHelper: Retried 19 times, SystemDownHandlerTriggerLimit = 20
INFO org.corfudb.runtime.collections.streaming.StreamPollingScheduler AbstractView 1597 layoutHelper: Invoking the systemDownHandler.
WARN org.corfudb.runtime.collections.streaming.StreamPollingScheduler CorfuDbConnector 1597 - [nsx@6876 comp="nsx-controller" level="WARNING" subcomp="corfu-service"] Restart CCP after Corfu system down
INFO nsx-rpc:CCP-###-###-###-###-###:user-executor-0 LrpDadStateReplicator 4063308 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="transport-node-adapter"] Transport node ###-###-###-###-### connection is down
INFO nsx-rpc:CCP-###-###-###-###-###:user-executor-0 LrpDadStateReplicator 4063308 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="transport-node-adapter"] Transport node ###-###-###-###-### connection is down
EVENT WrapperSimpleAppMain Main 1326362 - [nsx@6876 comp="nsx-controller" level="EVENT" subcomp="main"] CCP process started
INFO Owl-worker-20 TnDisconnectionAlarmReporterImpl 3963984 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="tn-disconnection-alarm"] Transport node ###-###-###-###-### with type EDGE_NODE is connected. Cancel all pending timers and reset connection statuses.
INFO Owl-worker-20 TnDisconnectionAlarmReporterImpl 3963984 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="tn-disconnection-alarm"] Transport node ###-###-###-###-### with type ESXI is connected. Cancel all pending timers and reset connection statuses.
On ESXi Transport nodes, log lines similar to the below are encountered in /var/run/log/nsx-syslog.log

cfgAgent[2102399]: NSX 2102399 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="4AB1A700" level="info"] DVSPropConvertor::OnDaemonHealthStateUpdate: CCPSessionState changed to down
[...]
cfgAgent[2102399]: NSX 2102399 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="4AB1A700" level="info"] DVSPropConvertor::OnDaemonHealthStateUpdate: CCPSessionState changed to up
On NSX Edge Transport nodes, log lines similar to the below are encountered in /var/log/syslog

NSX 4025 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" tid="4025" level="INFO"] Write ccp session message to nestdb ccp_id {#012 ###-###-###-###-####012}#012ip {#012 ipv4: #####012}#012server_port: 1235#012fqdn: #####012state: DISCONNECTED#012master: true#012failure_reason: CONNECTION_REFUSED

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX-T Data Center 3.x
VMware NSX

Cause

This issue may trigger on one or more NSX Managers in case of intermittent infrastructure issue.
Connectivity sensitivity between CCP (Central Control Plane) and Corfu database is 20 seconds in the NSX versions 3.2.x, 4.x to 4.2.1.
In the event of intermittent infrastructure issues between CCP and the Corfu DB, if the time limit of 20 seconds is exceeded, the Transport Nodes attached to the NSX Manager node will be re-registered with other available NSX Managers and re-sharding will be performed automatically.

Resolution

This issue is resolved in VMware NSX 4.2.2 available at Broadcom Downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

No impact to the environment since Controller service may auto restart after losing connection with Corfu DB (due to appliance infrastructure issue) for 20 seconds. Transport controller connection may be temporarily impacted and will recovered automatically.