Edge TEP failover scenarios
search cancel

Edge TEP failover scenarios

book

Article ID: 324200

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • All NSX versions
  • Data path outage while experiencing an Edge related failure or while testing Edge failover
  • Disconnect one Edge VM vNic resulting unexpected datapath outage



Environment

VMware NSX-T Data Center
VMware NSX

Cause

In a multi-TEP configuration, the Edge maps traffic for overlay segment to individual TEPs.

A TEP will be considered to have failed when there is a link down event on the network interface it is mapped to.

Tunnel/BFD state change to down does not trigger a TEP failover.



Consider a 2 TEP configuration

  • TEP1: IP1 and MAC1
  • TEP2: IP2 and MAC2
  • If TEP2 is considered failed due to a link down event, TEP2 will move to the same interface as TEP1 to continue processing traffic.
  • TEP2 will now send and receive traffic using IP2/MAC2 from the same interface as TEP1.

 

Bare Metal Edge

  • Taking a physical NIC down on a Bare Metal Edge is a supported failover action.
  • This link down event will trigger a TEP failover.
  • Traffic works as expected.

Edge VM

  • Taking a physical NIC down on the ESXi host where the Edge VM runs is a supported failover action.
  • This triggers a link down and the Edge vNIC will be mapped to the available NICs that are still up on the ESXi host.
  • In this scenario there is no TEP failover, failover is handled by the ESXi server.
  • It is not supported to test a failover of an Edge VM by disconnecting the virtual NIC interface.
  • As well as not being a valid real world failure scenario, it will not work.
  • Default security settings on an ESXi port group or NSX segment would prevent TEP1's interface transmitting with a forged MAC, MAC2, which doesn't belong to that interface.
  • For this to work, security settings would need to be tuned e.g. promiscuous mode if on port group, mac learning if on a segment etc.
  • Hence, VMware advises not to consider vNIC failures when testing Edge VM failure scenarios.

Bare Metal Edge and Edge VM

Note for both Bare Metal Edge and Edge VM there can be a corner case scenario where a TEP is considered Up because its associated uplink is up but the TEP's tunnels are down.
This condition can result in the blackholing of traffic for any segments mapped to that TEP.
NSX 4.2.1 introduced Group TEP High Availability for Edge nodes based on BFD session state. This feature handles this TEP failure scenario. The TEP Group will be marked as down and the other TEP Group will handle the traffic, see Release Notes.
It is enabled via API

GET /policy/api/v1/infra/connectivity-global-config

{
    // ...
    "global_replication_mode_enabled": false,
    "is_inherited": false,
    "site_infos": [],
    "tep_group_config": {
        "enable_tep_grouping_on_edge": false <-------------
    },
    "resource_type": "GlobalConfig",
    "id": "global-config",
    "display_name": "default",
    "path": "/infra/global-config",
    // ...
}


PUT /policy/api/v1/infra/connectivity-global-config

{
    // ...
    "global_replication_mode_enabled": false,
    "is_inherited": false,
    "site_infos": [],
    "tep_group_config": {
        "enable_tep_grouping_on_edge": true <-------------
    },
    "resource_type": "GlobalConfig",
    "id": "global-config",
    "display_name": "default",
    "path": "/infra/global-config",
    // ...
}

This API enables both TEP Grouping and High Availability for versions NSX 4.2.1 and above.

Resolution

This is known behaviour of NSX and it is working as designed.

NSX has alarms that notify for tunnel/BFD down events, these should always be investigated and resolved to ensure a fully functional environment.

Additional Information

For further troubleshooting assistance, please visit Troubleshooting NSX Edge High Availability.

 

If you are contacting Broadcom support about this issue, please provide the following:

  • NSX Edge log bundles for all Edges in the Edge Cluster
  • Ensure log date range covers the full date of the event(s) being investigated. When in doubt, retrieve logs for all time.
  • NSX Manager log bundles
  • ESXi host log bundles for all hosts supporting affected Edge VMs, and hosts whose TEPs are experiencing issues.
  • Text of any error messages seen in NSX GUI or command lines pertinent to the investigation

Handling Log Bundles for offline review with Broadcom support