NSX-Edge Failure Domain down alarm seen on NSX manager
search cancel

NSX-Edge Failure Domain down alarm seen on NSX manager

book

Article ID: 401122

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Failure domain down, NSX edge health and communication alarms are seen on NSX GUI and from syslogs of NSX edge below logs are seen showing failure to connect with controllers (/Var/log/syslog) :

####-##-##T##:##:##.###Z <Edge-hostname> NSX 3503 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="CRITICAL" eventFeatureName="edge_health" eventType="failure_domain_down" eventSev="critical" eventState="On" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"] All members of failure domain xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx are down.

####-##-##T##:##:##.###Z <Edge-hostname> NSX 3173 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="nsx-net" tid="5045" level="WARNING"] StreamConnection[11205 Connecting to ssl://<NSX-Manager IP>:1235 sid:11205] Couldn't connect to 'ssl://<NSX-Manager IP>:1235' (error: 110-Connection timed out)

####-##-##T##:##:##.###Z <Edge-hostname> NSX 3173 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="nsx-net" tid="5045" level="WARNING"] StreamConnection[11205 Error to ssl://<NSX-Manager IP>:1235 sid:-1] Error 110-Connection timed out

On the NSX edge, the output of "get controllers" shows status as "disconnected" and session state "down" as shown in below screenshot,



2025-08-06T04:01:47.624Z NSX-MGR NSX 82332 - [nsx@6876 comp="nsx-controller" level="EVENT" subcomp="handshake-server"] Accepts incoming connection from TN 1b01####-####-400d-####-eeb9####2c21
2025-08-06T04:01:47.624Z NSX-MGR NSX 82332 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="com.vmware.nsx.handshake.HandshakeState"] [1b01####-####-400d-####-eeb9####2c21, INIT, sId=4215####-####-4aa5-####-e7df####2c6c, nodeType=COMMON, , displayNodeType=Edge, , nodeName=Edge, false]: Moving to VERSION_CHECK_OK for TN 11b01####-####-400d-####-eeb9####2c21
2025-08-06T04:01:47.628Z NSX-MGR NSX 82332 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="handshake-server"] Close mastership for transport node 1b01####-####-400d-####-eeb9####2c21, controller is other node aaa0####-####-49df-####-25a6####9ded
2025-08-06T04:01:47.628Z NSX-MGR NSX 82332 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Endpoint 1b01####-####-400d-####-eeb9####2c21 de-registered
2025-08-06T04:01:47.629Z NSX-MGR NSX 82332 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="com.vmware.nsx.handshake.HandshakeState"] [1b01####-####-400d-####-eeb9####2c21, VERSION_CHECK_OK, sId=4215####-####-4aa5-####-e7df####2c6c, nodeType=COMMON, , displayNodeType=Edge, , nodeName=Edge, false]: Moving to INIT for TN 1b01####-####-400d-####-eeb9####2c21
2025-08-06T04:01:47.631Z NSX-MGR NSX 82332 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Connection closed received NettyConnection(NettyChannel(local=##.##.##.4:1235, remote=##.##.##.8:45980), active=false)
2025-08-06T04:01:47.631Z NSX-MGR NSX 82332 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] tcp:CCP-36de####-####-4fad-####-b55f####9888: Unregistering accepted NettyConnection(NettyChannel(local=##.##.##.4:1235, remote=##.##.##.8:45980), active=false) from its transport

Environment

VMware NSX

Cause

The NSX Edge is unable to open a TCP session with the NSX Controller on port 1235 for controller connection. The state of the TCP connection could be in CLOSE_WAIT or NA.

Resolution

Restart the NSX manager that manages NSX Edge, and make sure the Master controller is reachable from the transport node.

Additional Information

For troubleshooting, execute the command below in the root login of the NSX Edge and find the master CCP, and the connection is successful.

#less /var/log/syslog* | grep -i "Master CCP is"

2025-08-05T18:48:14.023Z NSX-EDGE NSX 5205 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" tid="5205" level="INFO"] Master CCP is - aaa0####-####-49df-####-25a6####9ded. Attempting get new stub.

# nc -v <NSX manager IP> 1235 

On the NSX manager in root prompt, execute the below command to display the status of the TCP endpoint with NSX edge 

# netstat -an | grep 1235 | grep <NSX Edge IP> 

After the NSX manager comes up (after reboot) check if the NSX edge can successfully establish controller connection with NSX managers by executing "get controllers" command on NSX edge, and check if the status shows "connected" and session state "up" as below 

If the NSX-Edge "Failure Domain Down" alarm appears along with Edge password expiration alarms - Refer to KB: https://knowledge.broadcom.com/external/article/316121/

====
Note:- 

NSX Controller-TN connection model requires TN to have working network connectivity with all three manager nodes, as the TN sharding is determined on the NSX manager side, assuming there is no networking issue.

And we have never supported changing a master controller node when a transport node is only able to reach specific controller nodes; when the Edge tries to connect to the other controller nodes, all they can do is forward the Edge to its master node (which it is unable to reach). This is an expected behavior from the controller in this disconnection state.