NSX Edge node in Unknown state due to ESXi host failure

Products

VMware NSX

Issue/Introduction

The standby Edge node transitioned to an unknown state causing a redundancy failure for the NSX native load balancer hosted on the T1 gateway.
The affected T1 gateway lost redundancy and currently operates as a single point of failure.
The Edge hosted ESXi has transitioned to not responding state due to hardware failure.
vSphere High Availability (HA) is Turned OFF on the underlying compute cluster.

The controller connectivity always shows connected shown in the below.

[root@esxi:~] nsxcli -c get controllers
Wed Sep 18 2024 UTC 06:10:42.545
 Controller IP    Port     SSL         Status       Is Physical Master   Session State  Controller FQDN
  ###.###.###.42     1235   enabled      not used            false              null              NA
  ###.###.###.40     1235   enabled     connected             true               up               NA
  ###.###.###.41     1235   enabled      not used            false              null              NA

The connectivity on the required ports between the edge and the Manager, as well as between the transport node and the Manager, appears to be functioning well based on the output from the nc -z command.
The API response (GET https://<nsx-mgr>/api/v1/transport-nodes/status) will give the same status as unknown for the affected nodes as shown in the below:

API for a specific TN : GET https://localhost/api/v1/transport-nodes/<TN_UUID>/status

{
  "node_uuid" : "########-####-####-####-####",
  "node_display_name" : "transport_node.example.com",
  "status" : "UNKNOWN",
  "pnic_status" : {
    "status" : "UNKNOWN",
    "up_count" : 0,
    "down_count" : 0,
    "degraded_count" : 0
  },
  "mgmt_connection_status" : "UP",
  "control_connection_status" : {
    "status" : "UNKNOWN",
    "up_count" : 0,
    "down_count" : 0,
    "degraded_count" : 0
  },
  "tunnel_status" : {
  "status" : "UNKNOWN",
    "up_count" : 0,
    "down_count" : 0
  },

  "node_status" : {
    "last_heartbeat_timestamp" : ###########,
    "mpa_connectivity_status" : "UP",
    "mpa_connectivity_status_details" : "Client is responding to heartbeats",
    "lcp_connectivity_status" : "UNKNOWN",
    "lcp_connectivity_status_details" : [ ],

The NSX manager log var/log/proton/nsxapi.log has a log similar to example:

nsxapi.log:[TIMESTAMP]  INFO UfoIndexer-search_manager-0 AggTnStatusQueriesImpl 4305 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] node ########-####-####-####-#### heartbeat timeout, current [EPOCH_TIME], ccp [EPOCH_TIME], interval 360000 in milliseconds, isExpired:true

nsxapi.log:[TIMESTAMP]  INFO http-nio-127.0.0.1-7440-exec-982 AggTnStatusQueriesImpl 4305 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-####" subcomp="manager" username="UC"] node ########-####-####-####-#### heartbeat timeout, current [EPOCH_TIME], ccp [EPOCH_TIME], interval 360000 in milliseconds, isExpired:true

Environment

VMware NSX

Cause

An ESXi host failure occurred while vSphere HA was disabled on the cluster.
This configuration prevented the Edge virtual machine from automatically restarting on a healthy host.

Resolution

Recommendation

Its highly recommended to enable Vsphere HA so that the NSX Edge node can get restarted on a healthy ESXi node in the cluster to avoid Edge failures.
vSphere HA should always be enabled for edges running services, even when in conjunction with an Active/Active Tier-0 gateway, as it allows to recover the lost capacity and redeploy the standby SRs before the standby relocation timeout kicks in.

Desgin Guide : Page 386, Section 7.6.3.1.3

Resolution

Recover the ESXi host that is currently in an unreachable state.
Validate that vSphere High Availability (HA) is enabled on the underlying compute cluster to ensure the Edge virtual machine restarts automatically during an ESXi host failure.

Additional Information

Reference :

Refer to the Design guide for the best practices : Desgin Guide