The standby Edge node transitioned to an unknown state causing a redundancy failure for the NSX native load balancer hosted on the T1 gateway.
The affected T1 gateway lost redundancy and currently operates as a single point of failure.
The controller connectivity always shows connected shown in the below.
[root@esxi:~] nsxcli -c get controllers
Wed Sep 18 2024 UTC 06:10:42.545
Controller IP Port SSL Status Is Physical Master Session State Controller FQDN
###.###.###.42 1235 enabled not used false null NA
###.###.###.40 1235 enabled connected true up NA
###.###.###.41 1235 enabled not used false null NA{
"node_uuid" : "########-####-####-####-####",
"node_display_name" : "transport_node.example.com",
"status" : "UNKNOWN",
"pnic_status" : {
"status" : "UNKNOWN",
"up_count" : 0,
"down_count" : 0,
"degraded_count" : 0
},
"mgmt_connection_status" : "UP",
"control_connection_status" : {
"status" : "UNKNOWN",
"up_count" : 0,
"down_count" : 0,
"degraded_count" : 0
},
"tunnel_status" : {
"status" : "UNKNOWN",
"up_count" : 0,
"down_count" : 0
},
"node_status" : {
"last_heartbeat_timestamp" : ###########,
"mpa_connectivity_status" : "UP",
"mpa_connectivity_status_details" : "Client is responding to heartbeats",
"lcp_connectivity_status" : "UNKNOWN",
"lcp_connectivity_status_details" : [ ],nsxapi.log:[TIMESTAMP] INFO UfoIndexer-search_manager-0 AggTnStatusQueriesImpl 4305 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] node ########-####-####-####-#### heartbeat timeout, current [EPOCH_TIME], ccp [EPOCH_TIME], interval 360000 in milliseconds, isExpired:truensxapi.log:[TIMESTAMP] INFO http-nio-127.0.0.1-7440-exec-982 AggTnStatusQueriesImpl 4305 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-####" subcomp="manager" username="UC"] node ########-####-####-####-#### heartbeat timeout, current [EPOCH_TIME], ccp [EPOCH_TIME], interval 360000 in milliseconds, isExpired:trueVMware NSX
Its highly recommended to enable Vsphere HA so that the NSX Edge node can get restarted on a healthy ESXi node in the cluster to avoid Edge failures.
vSphere HA should always be enabled for edges running services, even when in conjunction with an Active/Active Tier-0 gateway, as it allows to recover the lost capacity and redeploy the standby SRs before the standby relocation timeout kicks in.
Desgin Guide : Page 386, Section 7.6.3.1.3
Recover the ESXi host that is currently in an unreachable state.
Validate that vSphere High Availability (HA) is enabled on the underlying compute cluster to ensure the Edge virtual machine restarts automatically during an ESXi host failure.