You are using NSX-T version 3.2.1.2
External BGP neighborships are running on a Tier-0 Gateway router. There is one or more VRF gateways. These VRF gateways are connected to the Tier-0 Gateway router, which has the BGP sessions. The VRF gateways do not have any BGP neighborships on their own. However, BGP is 'Enabled' on the VRF gateways. (At VRF, BGP= ON, BGP Neighbors =0 ).
EVPN and VNI are configured in the VRF gateway context.
If a VRF gateway is deleted, it triggers a flap on all the BGP neighborships on the parent Tier-0 Gateway router. They remain down for 2-3 minutes, then come back UP again.
There will be BGP down notifications in the frr logs at Edge, mentioning BGP has been "Administratively Shutdown."
2022/06/08 20:38:14.551555 BGP: %NOTIFICATION: sent to neighbor ##.##.##.## 6/2 (Cease/Administratively Shutdown) 0 bytes2022/06/08 20:38:14.551589 BGP: %NOTIFICATION: sent to neighbor ##.##.##.## 6/2 (Cease/Administratively Shutdown) 0 bytes
We will also see a lot of log lines which suggest deletion of L3VNI specific configurations, EVPN Mac Interface changes on the neighbors, deletion of vxlan, loopback interfaces, etc, at the frr logs around the same timeline. These log lines will be similar to
2022/06/08 20:38:14.500852 BGP: Processing EVPN MAC interface change on peer ##.##.##.##2022/06/08 20:38:14.500877 BGP: Processing EVPN MAC interface change on peer ##.##.##.##2022/06/08 20:38:14.524829 ZEBRA: rib_update : AFI_IP event 02022/06/08 20:38:14.524851 BGP: Rx L3-VNI DEL VRF VRF-208 VNI 1500002022/06/08 20:38:14.524918 OSPF: Zebra: Interface[vxlan-150000] state change to down.2022/06/08 20:38:14.524934 PIM: pim_ifp_down: vxlan-150000 index 64(0) flags 4098 metric 0 mtu 1500 operative 02022/06/08 20:38:14.526655 ZEBRA: Del L3-VNI 150000 intf vxlan-150000(64)2022/06/08 20:38:14.526686 ZEBRA: MESSAGE: ZEBRA_INTERFACE_DELETE vxlan-150000(0) zclient: static2022/06/08 20:38:14.526715 BGP: Rx L3-VNI DEL VRF VRF-208 VNI 1500002022/06/08 20:38:14.526774 OSPF: Zebra: interface delete vxlan-150000 vrf default[0] index 64 flags 1002 metric 0 mtu 15002022/06/08 20:38:14.526775 PIM: pim_ifp_destroy: vxlan-150000 index 64(0) flags 4098 metric 0 mtu 1500 operative 02022/06/08 20:38:14.552068 BGP: ##.##.##.##: peer keepalive being removed, acquiring lock2022/06/08 20:38:14.552072 BGP: ##.##.##.##: peer keepalive removed2022/06/08 20:38:14.552147 BGP: %ADJCHANGE: neighbor ##.##.##.##(Unknown) in vrf default Down Neighbor deleted2022/06/08 20:38:14.552292 BGP: ##.##.##.##(0x1d6d33c3de10): close file descriptor2022/06/08 20:38:14.552300 BGP: bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 02022/06/08 20:38:14.552331 BGP: ##.##.##.## (0x1d6d33c3de10 -1) went from Established to Deleted2022/06/08 20:38:14.552333 BGP: Peer ##.##.##.## fd -1 send BGP_DOWN message to BGP adapter2022/06/08 20:38:14.552349 BGP: BGP Adapter: Send BGP_DOWN for peer ##.##.##.## (vrf: default)2022/06/08 20:38:14.552509 BGP: bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 02022/06/08 20:38:14.552541 BGP: Static announcement (0x1d6d33a29400 -1) went from Idle to Deleted2022/06/08 20:38:14.553726 BGP: uninstalling evpn prefix [5]:[0][##.##.##.##/32]/320 as ip prefix ##.##.##.##/32 in vrf VRF-162022/06/08 20:38:14.554024 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE pimreg(0) zclient: bgp2022/06/08 20:38:14.554030 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE uplink-307(0) zclient: bgp2022/06/08 20:38:14.554033 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE uplink-347(0) zclient: bgp2022/06/08 20:38:14.554036 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UPDATE/DELETE vxlan-150001(0) zclient: bgp2022/06/08 20:38:14.554247 BGP: Rx L3-VNI DEL VRF VRF-82 VNI 1500022022/06/08 20:38:14.554253 BGP: [EC 33554467] Cannot process L3VNI 150002 Del - Could not find EVPN BGP instance2022/06/08 20:38:14.554264 BGP: Rx L3-VNI DEL VRF VRF-35 VNI 1500012022/06/08 20:38:14.554270 BGP: [EC 33554467] Cannot process L3VNI 150001 Del - Could not find EVPN BGP instance2022/06/08 20:38:14.554382 BGP: Processing EVPN MAC interface change on peer ##.##.##.##2022/06/08 20:38:14.554385 BGP: Processing EVPN MAC interface change on peer ##.##.##.##
You will also see similar log-lines in log/rcpm/frr-reload.log at Edge node:
2022/06/08 20:38:14,355 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf VRF-2082022/06/08 20:38:14,450 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf2022/06/08 20:38:14,556 INFO: Executed "vtysh -c conf t -c no router bgp 4200099999"
This is a known issue wherein, an additional API call is being made during VRF deletion.
As soon as we try to delete the T0-VRF, which is configured for VNI,and is having BGP enabled (even though neighbors are not configured),a "no router" command is issued for that VRF instance to FRR. This fails since the deletion of VRF BGP instances with VNI configured,is not allowed at FRR level. And hence, multiple variants of the failed command are issued (as per frr-reload logic),which cause the removal of the default VRF BGP instance from FRR (where BGP is configured).
Removal of default vrf bgp instance configuration from FRR cause the BGP to flap.2022/06/08 22:17:29,281 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf VRF-352022/06/08 22:17:29,369 INFO: Failed to execute vtysh -c conf t -c no router bgp 4200099999 vrf2022/06/08 22:17:29,497 INFO: Executed "vtysh -c conf t -c no router bgp 4200099999" <----------
However, BGP comes back UP after few seconds as the default vrf bgp specific configuration is pushed to FRR again by RCPM.
This issue will be fixed in NSX-T version 3.2.3.0
Workaround:
As a workaround to this (to make sure BGP doesn't flap),we need to make sure that all the EVPN VNI specific configurations is removed from the VRF, prior to deletion of the VRF instance.
Workaround here is to un-configure VNI in VRF context and then delete VRF.