BFD and BGP Flap observed when NSX Edge VMs are vMotioned

Products

VMware NSX Networking

Issue/Introduction

This is a known behavior and expected of Edge VMs during vMotion.

Symptoms:

When Active Edge VM is vMotioned from one ESXi host to another, BFD and BGP flap is observed which recovers within a few seconds.

Logs:

Edge frr.log

:30.725986 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-885] bfd status [Down] vrf [default] peer [xx:xx:xx:xx] local [xx:xx:xx:xx]

1/1/1  01:01:01 ZEBRA: MESSAGE: ZEBRA_INTERFACE_BFD_DEST_UPDATE xx:xx:xx:xx/32 on uplink-885 Down event

1/1/1 01:01:01 BGP: [xx:xx:xx:xx]: BFD Down

....

1/1/1 01:01:01 BGP: [xx:xx:xx:xx]: BFD Up

1/1/1 01:01:01 BGP: BFD status for peer xx:xx:xx:xx changed from Down -> Up

1/1/1 01:01:01 BGP: xx:xx:xx:xx [FSM] Timer (start timer expire).

Edge syslog

BGP Down.

ventFeatureName="routing" eventSev="error" eventType="bgp_down"] In Router d304f6f0-1e1c-4ad8-b372-116c91da3b55, BGP neighbor 29288641-f5a2-4a29-b6bc-e64d5bd626ac (xx:xx:xx:xx) is down, reason: Network or config error.

...

BGP UP.

1/1/1 01:01:01 gtdc-lnsxedge-01.ims.cnp.local NSX 4928 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="29288641-f5a2-4a29-b6bc-e64d5bd626ac" tid="4965" level="ERROR" eventState="Off" eventFeatureName="routing" eventSev="error" eventType="bgp_down"] Context report: {"entity_id":"29288641-f5a2-4a29-b6bc-e64d5bd626ac","sr_id":"670a6103-3785-45ae-be9a-157f61e0ace9","lr_id":"d304f6f0-1e1c-4ad8-b372-116c91da3b55","bgp_neighbor_ip":"192.168.10.113","failure_reason":"BGP Established"}

1/1/1 01:01:01 gtdc-lnsxedge-01.ims.cnp.local bgpd 11191 - - %ADJCHANGE: neighbor xx:xx:xx:xx(Unknown) in vrf default Up


vmware.log for vmotion

1/1/1 01:01:01| vmx| I125: MigrateVMXdrToSpec: type: 1 srcIp=<xx:xx:xx:xx> dstIp=<xx:xx:xx:xx> mid=1647b15e666ece27 uuid=4c4c4544-0031-4d10-8037-b1c04f315132 priority=yes checksumMemory=no maxDowntime=0 encrypted=0 resumeDuringPageIn=no latencyAware=yes diskOpFile= srcLogIp=<<unknown>> dstLogIp=<<unknown>> ftPrimaryIp=<<unknown>> ftSecondaryIp=<<unknown>>

1/1/1 01:01:01| vmx| I125: MigrateSetInfo: state=8 srcIp=<xx:xx:xx:xx> dstIp=<xx:xx:xx:xx> mid=1605446811184451111 uuid=4c4c4544-0031-4d10-8037-b1c04f315132 priority=high

Environment

VMware NSX-T Data Center

Cause

This is an expected behavior when the Edge VM's are vMotioned, and if BFD timers are aggressive, a BGP flap can occur.

Resolution

N/A

Workaround:
Recommendation 1:

Avoid vMotioning Edge VMs.

Note: vMotion of Edge VMs should not happen under nominal operations and should be done only when necessary (during Host upgrades etc.). Also, as a best practice, vMotion of Edge VMs should only happen during a maintenance window.

Recommendation 2:

Increase the BFD timers to avoid flaps. Below is the document which has the details (Manager mode).

https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/administration/GUID-5B59FFAC-C31D-4801-840C-F52E6322D2C0.html

Additional Information

Impact/Risks:

Since BFD flaps will lead to a BGP flap, and in scenarios where there is only one uplink on the edge, there is some datapath impact until connectivity is restored between the BGP peers on the uplink and BGP has re-converged. If there are multiple uplinks on the Edge, the datapath will failover to the next best path and impact may be minimal.
It is possible that all uplinks experience a BFD flap during a vMotion, in which case there will be dataplane impact.