/var/log/frr/frr.log on Edge shows "Hold timer expire" for Inter SR BGP IPs (169.254.0.130 or 169.254.0.131) every 3 seconds:root@edge:~# grep 'Hold timer expire' /var/log/frr/frr.log2021/06/23 20:35:10.854171 BGP: 169.254.0.130 [FSM] Hold timer expire2021/06/23 20:35:15.854550 BGP: 169.254.0.130 [FSM] Hold timer expire2021/06/23 20:35:20.855850 BGP: 169.254.0.130 [FSM] Hold timer expire2021/06/23 20:35:25.857472 BGP: 169.254.0.130 [FSM] Hold timer expire/var/log/syslog shows BGP state flapping constantly:root@edge:~# grep "state=BGP" /var/log/syslog2021-06-15T22:30:40.999931+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_UP2021-06-15T22:30:44.993615+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_DOWN2021-06-15T22:30:46.009058+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_UP2021-06-15T22:30:49.994782+00:00 NSX 5182 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP 169.254.0.130, peer_uuid: <UUID> in SR: <UUID>, state=BGP_DOWNVMware NSX
If BGP Update packet size between Inter SR interfaces exceeds MTU along the datapath, the packet is dropped and Inter SR BGP peering will flap when trying to become established.
See the Additional Information section for details on how to identify this issue
In general, the Inter-SR port MTU, Global logical MTU (or Edge VTEP MTU), ESX PNIC MTU, and TOR MTU must have following relationship:
In the case for Federation, ICMP errors ("Fragmentation needed") should be enabled on the TOR so the Edge can perform PMTU discovery and fragment packets as needed.
These resolutions are recommended to cover for the GENEVE overhead
Example command to check number of routes on T0:
edge(tier0_sr)> get route | count via
Number of lines that match pattern 'via': ####
edge(tier0_sr)> get interfaces
edge(tier0_sr)> exitedge> set capture session 1 interface <UUID> direction dualedge> set capture session 1 file <filename>
edge(tier0_sr)> ping 169.254.0.131 size 1500 dfbit enable
This scenario with the BGP routing UPDATE packet exceeding the MTU can occur when there are enough prefixes that make the Update packet size larger than the MTU.