Tunnels go down on an NSX Edge configured with multiple VTEPs after a single uplink on the hosting ESXi node is brought down.
Outage observed for North-South traffic affecting workload VMs.
Southbound traffic fails to switch over to the remaining active (UP) tunnels.
Edge configuration confirms TEP Group is disabled: "tn-tep-group-enabled": false.
Edge vmkernel logs (datapathd) indicate a 50% drop in active tunnels following the uplink failure.
Log Evidence: Before uplink down event: 2023-12-12T07:12:12 NSX [nsx comp="nsx-edge" subcomp="datapathd" level="INFO"] Total tunnels: 440, up: 440, down: 0, unknown: 0, skipped: 0
After bringing down one uplink: 2023-12-12T07:13:55 NSX [nsx comp="nsx-edge" subcomp="datapathd" level="INFO"] Total tunnels: 440, up: 220, down: 220, unknown: 0, skipped: 0
The NSX Edge is currently operating in native multi-VTEP mode. In native multi-VTEP mode, the Edge does not dynamically select an active VTEP for outbound traffic if a path fails. Failover is not natively supported in this mode, and Bidirectional Forwarding Detection (BFD) is not responsible for traffic forwarding operations under this configuration. The traffic failure is an expected architectural behavior when "tn-tep-group-enabled" is set to false.
For failover to work in this scenario, Edge must be reconfigured to use TEP Group mode.
NSX 4.2.1 introduced Group TEP High Availability for Edge nodes based on BFD session state.
This feature addresses the TEP failure scenario: when a TEP Group is marked Down, another TEP Group takes over the traffic Release Notes. This feature is not enabled by default.