"All members of failure domain are down" alarm gets triggered and resolved by itself
search cancel

"All members of failure domain are down" alarm gets triggered and resolved by itself

book

Article ID: 381000

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

A similar alarm was frequently reported in the NSX UI and resolved shortly after. 

2024-10-20T13:33:51.159Z edge.local NSX 5829 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="CRITICAL" eventFeatureName="edge_health" eventType="failure_domain_down" eventSev="critical" eventState="On" entId="2211cef2-xxxx-xxxx-xxxx-7d98603929e2"] All members of failure domain 2211cef2-xxxx-xxxx-xxxx-7d98603929e2 are down.
2024-10-20T13:34:52.644Z edge.local NSX 5829 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="CRITICAL" eventFeatureName="edge_health" eventType="failure_domain_down" eventSev="critical" eventState="Off" entId="2211cef2-xxxx-xxxx-xxxx-7d98603929e2"] All members of failure domain 2211cef2-xxxx-xxxx-xxxx-7d98603929e2 are reachable.

 

Environment

VMware NSX 4.x

Cause

The alarm arose due to the edge node having a networking connection issue with the NSX manager and reconnected back repeatedly.

 

NSX manager controller logs:
------------------------
2024-10-22T16:31:58.515Z  INFO CCP-5ae7a2dc-xxxx-xxxx-xxxx-d88c3d6747e0:worker-2 NettyConnection 1297606 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Connection closed received NettyConnection(NettyChannel(local=xx.xx.xx.xx:1235, remote=xx.xx.xx.xx:58231), active=false)

2024-10-22T16:31:58.516Z  INFO nsx-rpc:CCP-5ae7a2dc-xxxx-xxxx-xxxx-d88c3d6747e0:user-executor-3 VersionMastershipServiceImpl 1297606 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="handshake-server"] closeStream id b0de2395-xxxx-xxxx-xxxx-526aa3e9defe status Status(code=COMMUNICATION_ERROR, msg=null)
------------------------


Edge node logs:
------------------------
Connection to MP through 1234:

2024-10-22T14:07:35.919Z gldc13-edge1.danfoss.local NSX 5211 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="5967" level="INFO"] RpcConnection[4455 Closed to ssl://xx.xx.xx.xx:1234 0] Notifying channels on connection down (network error)

Connection to CCP through 1235:
2024-10-22T15:01:17.993Z gldc13-edge1.danfoss.local NSX 5211 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="5967" level="INFO"] RpcConnection[4572 Closed to ssl://xx.xx.xx.xx:1235 0] Notifying channels on connection down (network error)

Resolution

This is not an NSX issue. Stabilize infra networking is needed.