Edge TEP failover scenarios
search cancel

Edge TEP failover scenarios

book

Article ID: 324200

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • All NSX-T Data Center versions
  • Data path outage while experiencing an Edge related failure or while testing Edge failover


Environment

VMware NSX-T Data Center
VMware NSX 4.0.0.1

Cause

In a multi-TEP configuration, the Edge maps traffic for overlay segment to indivdual TEPs.
A TEP will be considered to have failed when there is a link down event on the network interface it is mapped to.
Tunnel/BFD state change to down does not trigger a TEP failover.

Consider a 2 TEP configuration
TEP1: IP1 and MAC1
TEP2: IP2 and MAC2
If TEP2 is considered failed due to a link down event, TEP2 will move to the same interface as TEP1 to continue processing traffic.
TEP2 will now send and receive traffic using IP2/MAC2 from the same interface as TEP1.


Bare Metal Edge
Taking a physical NIC down on a Bare Metal Edge is a supported failover action.
This link down event will trigger a TEP failover.
Traffic works as expected.


Edge VM
Taking a physical NIC down on the ESXi host where the Edge VM runs is a supported failover action.
This triggers a link down and the Edge vNIC will be mapped to the available NICs that are still up on the ESXi host.
In this scenario there is no TEP failover, failover is handled by the ESXi server.
It is not supported to test a failover of an Edge VM by disconnecting the virtual NIC interface.
As well as not being a valid real world failure scenario, it will not work.
Default security settings on an ESXi portgroup or NSX segment would prevent TEP1's interface transmitting with a forged MAC, MAC2, which doesn't belong to that interface.
For this to work, security settings would need to be tuned e.g. promiscuous mode if on portgroup, mac learning if on a segment etc.
Hence, VMware advises not to consider vNIC failures when testing Edge VM failure scenarios.


Bare Metal Edge and Edge VM
Note for both Bare Metal Edge and Edge VM there can be a corner case scenario where a TEP is considered Up because its associated uplink is up but the TEP's tunnels are down.
This condition can result in the blackholing of traffic for any segments mapped to that TEP.

Resolution

This is known behaviour of NSX-T Data Center and is currently working as designed.

Workaround:
NSX has alarms that notify for tunnel/BFD down events, these should always be investigated and resolved to ensure a fully functional environment.