BGP flap on T0 triggered by deletion of VRF with EVPN enabled
search cancel

BGP flap on T0 triggered by deletion of VRF with EVPN enabled

book

Article ID: 324562

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

NSX-T version 3.2.1.2

External BGP neighborships are running on a Tier-0 Gateway router. There is one or more VRF gateways. These VRF gateways are connected to the Tier-0 Gateway router which has the BGP sessions. The VRF gateways do not have any BGP neighborships on their own. However BGP is 'Enabled' on the VRF gateways. (At VRF, BGP= ON, BGP Neighbors =0 ).

EVPN and VNI are configured in the VRF gateway context.

If a VRF gateway is deleted , it triggers a flap on all the BGP neighborships on the parent Tier-0 Gateway router. They remain down for 2-3 minutes, then come back UP again .

There will be BGP down notifications at frr logs at Edge, mentioning BGP has been "Administratively Shutdown."

2022/06/08 20:38:14.551555 BGP: %NOTIFICATION: sent to neighbor 10.99.32.1 6/2 (Cease/Administratively Shutdown) 0 bytes
2022/06/08 20:38:14.551589 BGP: %NOTIFICATION: sent to neighbor 10.99.66.1 6/2 (Cease/Administratively Shutdown) 0 bytes

 

We will also see a lot of log-lines which suggest deletion of L3VNI specific configurations, EVPN Mac Interface changes on the neighbors, deletion of vxlan, loopback interfaces, etc, at the frr logs around the same timeline . These log-lines will be similar to-

 

 

2022/06/08 20:38:14.500852 BGP: Processing EVPN MAC interface change on peer 10.99.32.1

2022/06/08 20:38:14.500877 BGP: Processing EVPN MAC interface change on peer 10.99.66.1

2022/06/08 20:38:14.524829 ZEBRA: rib_update : AFI_IP event 0

2022/06/08 20:38:14.524851 BGP: Rx L3-VNI DEL VRF VRF-208 VNI 150000

2022/06/08 20:38:14.524918 OSPF: Zebra: Interface[vxlan-150000] state change to down.

2022/06/08 20:38:14.524934 PIM: pim_ifp_down: vxlan-150000 index 64(0) flags 4098 metric 0 mtu 1500 operative 0

2022/06/08 20:38:14.526655 ZEBRA: Del L3-VNI 150000 intf vxlan-150000(64)

2022/06/08 20:38:14.526686 ZEBRA: MESSAGE: ZEBRA_INTERFACE_DELETE vxlan-150000(0) zclient: static

2022/06/08 20:38:14.526715 BGP: Rx L3-VNI DEL VRF VRF-208 VNI 150000

2022/06/08 20:38:14.526774 OSPF: Zebra: interface delete vxlan-150000 vrf default[0] index 64 flags 1002 metric 0 mtu 1500

2022/06/08 20:38:14.526775 PIM: pim_ifp_destroy: vxlan-150000 index 64(0) flags 4098 metric 0 mtu 1500 operative 0

2022/06/08 20:38:14.552068 BGP: 10.99.58.1: peer keepalive being removed, acquiring lock

2022/06/08 20:38:14.552072 BGP: 10.99.58.1: peer keepalive removed

2022/06/08 20:38:14.552147 BGP: %ADJCHANGE: neighbor 10.99.66.1(Unknown) in vrf default Down Neighbor deleted

2022/06/08 20:38:14.552292 BGP: 10.99.58.1(0x1d6d33c3de10): close file descriptor

2022/06/08 20:38:14.552300 BGP: bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 0

2022/06/08 20:38:14.552331 BGP: 10.99.58.1 (0x1d6d33c3de10 -1) went from Established to Deleted

2022/06/08 20:38:14.552333 BGP: Peer 10.99.66.1 fd -1 send BGP_DOWN message to BGP adapter

2022/06/08 20:38:14.552349 BGP: BGP Adapter: Send BGP_DOWN for peer 10.99.58.1 (vrf: default)

2022/06/08 20:38:14.552509 BGP: bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 0

2022/06/08 20:38:14.552541 BGP: Static announcement (0x1d6d33a29400 -1) went from Idle to Deleted

2022/06/08 20:38:14.553726 BGP: uninstalling evpn prefix [5]:[0][172.99.66.5/32]/320 as ip prefix 172.99.66.5/32 in vrf VRF-16

2022/06/08 20:38:14.554024 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE pimreg(0) zclient: bgp

2022/06/08 20:38:14.554030 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE uplink-307(0) zclient: bgp

2022/06/08 20:38:14.554033 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE uplink-347(0) zclient: bgp

2022/06/08 20:38:14.554036 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE vxlan-150001(0) zclient: bgp

2022/06/08 20:38:14.554247 BGP: Rx L3-VNI DEL VRF VRF-82 VNI 150002

2022/06/08 20:38:14.554253 BGP: [EC 33554467] Cannot process L3VNI 150002 Del - Could not find EVPN BGP instance

2022/06/08 20:38:14.554264 BGP: Rx L3-VNI DEL VRF VRF-35 VNI 150001

2022/06/08 20:38:14.554270 BGP: [EC 33554467] Cannot process L3VNI 150001 Del - Could not find EVPN BGP instance

2022/06/08 20:38:14.554382 BGP: Processing EVPN MAC interface change on peer 169.24.0.99

2022/06/08 20:38:14.554385 BGP: Processing EVPN MAC interface change on peer 169.24.0.99

 

 

You will also see similar log-lines in log/rcpm/frr-reload.log at Edge node:

 

2022/06/08 20:38:14,355 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf VRF-208

2022/06/08 20:38:14,450 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf

2022/06/08 20:38:14,556 INFO: Executed "vtysh -c conf t -c no router bgp 4200099999"

 


Environment

VMware NSX-T Data Center

Cause

This is a bug. An additional API call is being made during VRF deletion.

 

As soon as we try to delete the T0-VRF, which is configured for VNI,and is having BGP enabled (even though neighbors are not configured),a "no router" command is issued for that VRF instance to FRR. This fails since the deletion of VRF BGP instances with VNI configured,is not allowed at FRR level. And hence, multiple variants of the failed command are issued (as per frr-reload logic),which cause the removal of the default VRF BGP instance from FRR (where BGP is configured).

 

Removal of default vrf bgp instance configuration from FRR cause the BGP to flap.

 

2022/06/08 22:17:29,281 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf VRF-35

2022/06/08 22:17:29,369 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf

2022/06/08 22:17:29,497 INFO: Executed "vtysh -c conf t -c no router bgp 4200099999" <----------

 

However, BGP comes back UP after few seconds as the default vrf bgp specific configuration is pushed to FRR again by RCPM.

Resolution

This bug will be fixed in NSX-T version 3.2.3.0


Workaround:

As a workaround to this (to make sure BGP doesn't flap),we need to make sure that all the EVPN VNI specific configurations is removed from the VRF,prior to deletion of the VRF instance.


Additional Information

Workaround here is to un-configure VNI in VRF context and then delete VRF.