/var/log/frr/frr.log
on Edge shows "Hold timer expire
" for Inter SR BGP IPs (169.254.0.130 or 169.254.0.131) every 3 seconds:root@edge:~# grep 'Hold timer expire' /var/log/frr/frr.log
2021/06/23 20:35:10.854171 BGP: 169.254.0.130 [FSM] Hold timer expire
2021/06/23 20:35:15.854550 BGP: 169.254.0.130 [FSM] Hold timer expire
2021/06/23 20:35:20.855850 BGP: 169.254.0.130 [FSM] Hold timer expire
2021/06/23 20:35:25.857472 BGP: 169.254.0.130 [FSM] Hold timer expire
/var/log/syslog shows
BGP state flapping constantly:root@edge:~# grep "state=BGP" /var/log/syslog
2021-06-15T22:30:40.999931+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_UP
2021-06-15T22:30:44.993615+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_DOWN
2021-06-15T22:30:46.009058+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_UP
2021-06-15T22:30:49.994782+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_DOWN
VMware NSX
If BGP Update packet size between Inter SR interfaces exceeds MTU along the datapath, the packet is dropped and Inter SR BGP peering will flap when trying to become established.
See the Additional Information section for details on how to identify this issue
In general, the Inter-SR port MTU, Global logical MTU (or Edge VTEP MTU), ESX PNIC MTU, and TOR MTU must have following relationship:
In the case for Federation, ICMP errors ("Fragmentation needed") should be enabled on the TOR so the Edge can perform PMTU discovery and fragment packets as needed.
These resolutions are recommended to cover for the GENEVE overhead
Example command to check number of routes on T0:
edge(tier0_sr)> get route | count via
Number of lines that match pattern 'via': ####
edge(tier0_sr)> get interfaces
edge(tier0_sr)> exit
edge> set capture session 1 interface <UUID> direction dual
edge> set capture session 1 file <filename>
edge(tier0_sr)> ping 169.254.0.131 size 1500 dfbit enable
This scenario with the BGP routing UPDATE packet exceeding the MTU can occur when there are enough prefixes that make the Update packet size larger than the MTU.