BFD tunnel down when the VTEP gateway MAC is unresolved or changed in NSX-T
search cancel

BFD tunnel down when the VTEP gateway MAC is unresolved or changed in NSX-T

book

Article ID: 325114

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
The case happens when the edge VM locates on the ESX transport node, the BFD session (between the ESXi local VTEP and the VTEP in edge VM) shows down if the VTEP MAC needs to change to "VTEP gateway is MAC".

The BFD session status can be shown in the Heatmap UI in manager or by the CLI command on the host, such as "net-vdl2 -M bfd -s <nsxswitch name>" on host. You can use the pktcap-uw command to capture BFD packets to check the dest MAC of BFD packets. 

If the destination MAC is not the VTEP's gateway MAC, then it meets the issue in this article. The CLI command to capture BFD packet is:

pktcap-uw --capture UplinkSndKernel --uplink <vmnic name> -o - | tcpdump-uw -enr - | egrep BFD

Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

This issue occurs because the BFD logic does not monitor the gateway MAC change.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.0.2 available at VMware Downloads

Workaround:
To work around this issue, restart the netcpa on the ESX transport node which has the BFD session down issue using this command

/etc/init.d/netcpad restart

Alternatively, reset the VMs. You can check the BFD session reference by a CLI command.

For example: 

BFD count:    1
===========================
Local IP: 172.19.223.119, Remote IP: 172.19.223.120, Local State: up, Remote State: up, Local Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000, mult: 3, isDisabled: 0, l2SpanCount: 1, l3SpanCount: 1
Roundtrip Latency: NOT READY
VNI List: 73729    
Routing Domain List: a050044f-8890-4060-b92f-8204143c740a


This BFD session is referenced by VNI 73729 and Routing domain a050044f-8890-4060-b92f-8204143c740a. You needs to reset all VMs locating in this VNI/Routing domain.

Note: VMware recommends to restart the netcpad service as the first option as restarting is the simpler workaround of the two.