Data path outage while experiencing an Edge related failure or while testing Edge failover
Environment
VMware NSX-T Data Center VMware NSX 4.0.0.1
Cause
In a multi-TEP configuration, the Edge maps traffic for overlay segment to indivdual TEPs. A TEP will be considered to have failed when there is a link down event on the network interface it is mapped to. Tunnel/BFD state change to down does not trigger a TEP failover.
Consider a 2 TEP configuration TEP1: IP1 and MAC1 TEP2: IP2 and MAC2 If TEP2 is considered failed due to a link down event, TEP2 will move to the same interface as TEP1 to continue processing traffic. TEP2 will now send and receive traffic using IP2/MAC2 from the same interface as TEP1.
Bare Metal Edge Taking a physical NIC down on a Bare Metal Edge is a supported failover action. This link down event will trigger a TEP failover. Traffic works as expected.
Edge VM Taking a physical NIC down on the ESXi host where the Edge VM runs is a supported failover action. This triggers a link down and the Edge vNIC will be mapped to the available NICs that are still up on the ESXi host. In this scenario there is no TEP failover, failover is handled by the ESXi server. It is not supported to test a failover of an Edge VM by disconnecting the virtual NIC interface. As well as not being a valid real world failure scenario, it will not work. Default security settings on an ESXi portgroup or NSX segment would prevent TEP1's interface transmitting with a forged MAC, MAC2, which doesn't belong to that interface. For this to work, security settings would need to be tuned e.g. promiscuous mode if on portgroup, mac learning if on a segment etc. Hence, VMware advises not to consider vNIC failures when testing Edge VM failure scenarios.
Bare Metal Edge and Edge VM Note for both Bare Metal Edge and Edge VM there can be a corner case scenario where a TEP is considered Up because its associated uplink is up but the TEP's tunnels are down. This condition can result in the blackholing of traffic for any segments mapped to that TEP.
Resolution
This is known behaviour of NSX-T Data Center and is currently working as designed.
Workaround: NSX has alarms that notify for tunnel/BFD down events, these should always be investigated and resolved to ensure a fully functional environment.