NSX transport nodes show controller connectivity down while manager connectivity remains up

Products

VMware NSX

Issue/Introduction

A subset of ESXi transport nodes report Controller Connectivity Down in the NSX Manager UI while Manager Connectivity stays Up and Configuration State shows Success. The affected hosts continue to show this state across reboots of host NSX services.

Running the following command on an affected host shows the assigned controller in a disconnected state with no failure reason:

nsxcli -c "get controllers"

The output shows the controller marked as Physical Master with Status disconnected and Session State down, while the remaining controllers show as not used. The Failure Reason field is empty (NA), so the command provides no indication of why the session is down.

Manager connectivity from the same host is healthy. Running the following command shows all managers connected:

nsxcli -c "get managers"

TCP port 1235 is reachable from the affected host to all NSX Manager nodes, and the TLS handshake to the controller completes. The condition affects only the transport nodes whose assigned controller is one specific NSX Manager node. Transport nodes assigned to the other NSX Manager nodes are unaffected.

Environment

VMware Cloud Foundation 9.
VMware NSX 9.x

Cause

At least one NSX Manager node in a three-node cluster holds a sharding state that does not advance to ready for the affected transport nodes. The control plane on that node accepts the transport node connection, trusts the host certificate, and verifies versions, but the shard lookup does not complete.

On the affected NSX Manager node, the central control plane log records the following for each affected transport node, repeating continuously:

/var/log/cloudnet/nsx-ccp.log
Sharding table is not ready to check for TN <transport-node-id>

This state can follow rapid restarts of control plane services on that NSX Manager node, where a brief down-and-up transition is not propagated to the other cluster members as a membership change. Because the cluster leader does not observe the member leaving and rejoining, it does not generate a new sharding update after the node rejoins. The affected node is left without a current sharding table, and its handshake component rejects the transport node sessions while the other two nodes remain healthy.

Resolution

Restart the affected NSX Manager nodes to clear the stale sharding state and trigger a fresh sharding update once the node rejoins the cluster.

Identify the NSX Manager node logging the sharding messages. Search the central control plane log on each manager node:
```
grep "Sharding table is not ready to check for TN" /var/log/cloudnet/nsx-ccp.log
```
The node returning matches is the affected node. The other nodes return no matches.
Confirm cluster health on a healthy manager node before proceeding:
```
get cluster status
```
Reboot the affected NSX Manager node. The other two manager nodes stay online and maintain quorum during the reboot.
After the node boots, confirm all services report stable and the cluster shows all three nodes up:
```
get cluster status
```
Verify the affected hosts re-establish controller connectivity:
```
nsxcli -c "get controllers"
```
The previously affected controller now shows Status connected and Session State up. The hosts recover without any further action on the hosts themselves.

If the affected hosts do not recover after the manager node reboots and the cluster reports healthy, collect a support bundle from all NSX Manager nodes and the affected hosts at the time the issue is present, then open a support request with Broadcom. Capturing the bundles before any further service restarts preserves the central control plane and cluster membership logs needed to determine the underlying trigger.