Host Transport Node Status: Agent Degraded showing nsx

Products

VMware NSX

Issue/Introduction

In the NSX UI, on the System > Fabric > Hosts page, one or more ESX transport nodes are in a degraded state.
Drilling down in the degraded host shows that the NSX_VDL2 agent is in a degraded state.

You see messages similar to the following in /var/log/proton/nsxapi.log file on the NSX manager node:

2025-10-25T22:15:32.658Z  INFO ShaVerticalMessageDispatcher1 ShaVerticalMessageHandler 5988 MONITORING [nsx@4413 comp="nsx-manager" level="INFO" subcomp="manager"] receive agent data AgentStatusAggregationModel{TransportNode =######-####-####-####-########...
AgentInfo{AgentName=NSX_VDL2, Status=DEGRADED, errorDetail=All bfd tunnels are down for some of vteps., connectionStatus=null,

025-10-31T16:32:30Z In(182) nsx-sha: NSX 2100889 - [nsx@4413 comp="nsx-esx" subcomp="sha" username="root" level="INFO"] Dhreport - reporting, tn_agent_status:{"oid": {"left":"########288277069942","right":"########022311121664"},"status":[{"type":"TN_AGENT","tn_agent_status":{"threat_state":{},"agent_state": "total_status":"DEGRADED","up_count":17,"down_count":0,"agent_status":[{"type":"NSX_ENS","status":"UP","health_metrics":[{"metric_name":"last_status_update_time","value":"1761928322713","type":"TIMESTAMP","unit":"MILLISECOND"},"metric_name":"uptime","value":"2495020174","type":"INTEGER","unit":"MILLISECOND"}]},{"type":"NSX_VDR","status":"UP","health_metrics":[{"metric_name":"last_status_update_time","value":"1761928322713","type":"TIMESTAMP","unit":"MILLISECOND"},{"metric_name":"uptime","value":"2495021408","type":"INTEGER","unit":"MILLISECOND"}]},{"type":"NSX_VDL2","status":"DEGRADED","error_detail":"All bfd tunnels are down for some of vteps."

You see messages similar to the following in the /var/run/log/vmkernel.log file on the affected ESX host:

2025-10-28T22:04:10.183Z In(182) vmkernel: cpu48:2098448)VDL2isVmknicBfdWait:4385:[nsx@4413 comp="nsx-esx" subcomp="vdl2-24952112"][vmknic:vmk11 ID:0 switch:DvsPortset-0] state change to BFD down
2025-10-28T22:04:10.183Z In(182) vmkernel: cpu48:2098448)VDL2SetVmknicBfdDown:1736:[nsx@4413 comp="nsx-esx" subcomp="vdl2-24952112"][vmknic:vmk10 ID:0 switch:DvsPortset-0] old state : 3, vmknic down

There may be VMs running on the affected host without issue.
The host is configured with more than one Tunnel Endpoint (TEP) address.

Checking the status of the TEP interfaces on the host shows that one them is in a BFD_DOWN state:

net-vdl2 -l -s <vSwitch name>
...
                VTEP Interface: vmk11
                        DVPort ID:      05f0af3f-####-####-####-5c429122df85
                        Switch Port ID: ########
                        Endpoint ID:    1
                        VLAN ID:        110
                        Label:          #####
                        Uplink Port ID: ##########
                        Is Uplink Port LAG:     No
                        IP:             192.168.10.9
                        Netmask:        255.255.255.224
                        Segment ID:     192.168.10.0
                        IPv6:           ::
                        Prefix Length:  0
                        Segment ID6:    ::
                        GW IP:          192.168.10.1
                        GW MAC:         00:50:56:0d:00:0a
                        GW V6 IP:               ::
                        GW V6 MAC:              00:00:00:00:00:00
                        IP Acquire Timeout:     0
                        IPv6 Acquire Timeout:   0
                        Multicast Group Count:  0
                        Is DRVTEP:      No
                        State4:         DOWN : BFD_DOWN/VALID_IP
                        State6:         DOWN : INIT/NO_IP
...

Checking the BFD sessions on the host shows that there are no sessions associated with the vmk# interface with the BFD_DOWN state:

nsxdp-cli bfd sessions list
Remote                        Local                         local_disc          remote_disc         recvd               sent      local_state         local_diag                              client              flaps               bfd_type
192.168.10.69                 192.168.10.3                  3d723636            ceecb168            265823              260781      up                  No Diagnostic                           vdl2                9                   Tunnel
192.168.10.67                 192.168.10.3                  eb46069b            d932cc7d            265814              260713      up                  No Diagnostic                           vdl2                11                  Tunnel
192.168.10.66                 192.168.10.3                  6385437d            f1fc09d0            265825              260662      up                  No Diagnostic                           vdl2                9                   Tunnel
192.168.10.68                 192.168.10.3                  65247fe0            91492889            265794              260765      up                  No Diagnostic                           vdl2                9                   Tunnel

Note: The sessions noted in this output are for a vmk# with IP address 192.168.10.3. In this example, vmk11 is the interface with the BFD_DOWN status and has IP address 192.168.10.9. This interface has no sessions shown here.

Testing communication over the vmk# in question shows that it is able to communicate with other TEP addresses in the cluster:

vmkping -I vmk11 -S vxlan -d -s 1572 192.168.10.7
PING 192.168.10.7 (192.168.10.7): 1572 data bytes
1580 bytes from 192.168.10.7: icmp_seq=0 ttl=64 time=1.920 ms
1580 bytes from 192.168.10.7: icmp_seq=1 ttl=64 time=0.741 ms
1580 bytes from 192.168.10.7: icmp_seq=2 ttl=64 time=0.772 ms

Performing a packet capture on the suspect interface shows that no BFD traffic is being sent or received. A command similar to pktcap-uw --vmk vmk11 --dir 2 -o - | tcpdump-uw -nnevv -r - 'udp port 3784' run on the affected ESXi host shows no output.

Environment

VMware NSX

Cause

This issue will occur when there is a temporary interruption in BFD traffic between transport nodes. When the issue causing the disruption in BFD traffic is resolved, the status of the host's TEP is left in the "All bfd tunnels are down..." state indefinitely. VMs may be running on the host without issue as they are all using the other TEP on the host that is not experiencing the issue.

Resolution

This is a known issue affecting VMware NSX and vSphere ESX hosts prepared as NSX transport nodes. There is currently no resolution.

To workaround this issue, a VM must establish a new connection to an NSX-backed segment for its networking. This could be accomplished via one of the following options:

Migrate an existing VM that is using an NSX-backed segment for its networking to the host.
Create a dummy VM on the host, configure it to use an NSX-backed segment for its network, and power it on.
Power cycle an existing VM on the host that is using an NSX-backed segment for its networking.

Note: This may not immediately resolve the issue as the determination of which TEP interface to use is not guaranteed to use the affected TEP. In this scenario, repeat the process until affected TEP status is not BFD_DOWN.

Additional Information

Alternatively, the affected host could be placed into maintenance mode and rebooted.