IPv6 BGP sessions are flapping on the Edge Transport Node

Products

VMware NSX

Issue/Introduction

IPv6 BGP interfaces configured on Tier-0.
BGP session is continuously flapping with reason ‘hold timer expiry’.
Once BGP session is successfully established, BGP update message retransmissions may be observed (e.g. on Edge uplink interfaces).
Session resets as BGP keep-alive message is not received within the configured hold timer from the peer.
Alarm for "MTU mismatch within same transport zone" may trigger on NSX Manager.

Environment

VMware NSX
VMware NSX-T Data Center

Cause

MTU mismatch, where MTU on the Edge's logical interface is higher than MTU on Edge's physical ports (fp-eth interfaces).
BGP will use MTU on the fp-eth, while the peer will use MTU setting on any interface in the path, resulting in bigger packet sent from the peer. Such packet will be dropped and will be continuously retransmitted by the peer.

Resolution

Verify and fix the MTU along the path between BGP neighbors.
For MTU setting in NSX 4.2, please refer to the following documentation: Guidance to Set Maximum Transmission Unit.
Verify that the MTU on the source logical interface on edge and the BGP peer interface are same.

Additional Information

The following can be used to confirm MTU inconsistency between BGP peers:

Find the IP address of source and IP address of BGP Neighbor:
1. In NSX UI, find the impacted Tier-0.
2. Click on the three vertical dots.
3. Select "Generate BGP Summary"
4. Identify the impacted peer(s).
SSH to the Edge node as admin.
Find the SR instance VRF for the impacted Tier-0 gateway:
> get logical router
Enter this instance:
> vrf <VRF/UUID/Name of the impacted Tier-0 SR>
Ping the neighbour using the IP addresses found in step 1/d:
1. For IPv4:
  > ping <neighbour_IPv4> source <local_IPv4> dfbit enable size <MTU minus 28>
  e.g. If MTU on T0 external interface is set to 9000, use value 8972 in "size" parameter (9000-28=8972, where the 28 bytes represent IP and ICMP overhead).
2. For IPv6:
  > ping6 <neighbour_IPv6> source <local_IPv6> size <MTU minus 48>

Output below shows local failure (MTU on the external interface is not big enough to accommodate the ping):
edge02(tier0_sr[1])> ping 192.168.132.254 source 192.168.132.2 dfbit enable size 8873
PING 192.168.132.254 (192.168.132.254) from 192.168.132.2: 8873 data bytes
36 bytes from 192.168.132.254: frag needed and DF set (MTU 8900)
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 22c5 0000 0 0000 40 01 8de6 192.168.132.2 192.168.132.254

36 bytes from 192.168.132.254: frag needed and DF set (MTU 8900)
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 22c5 0000 0 0000 40 01 8de6 192.168.132.2 192.168.132.254

Failure due to insufficient MTU on the uplink profile or BGP neighbor is less than the MTU on the external interface:
edge02(tier0_sr[1])> ping 192.168.132.254 source 192.168.132.2 dfbit enable size 8872
PING 192.168.132.254 (192.168.132.254) from 192.168.132.2: 8872 data bytes
^C
--- 192.168.132.254 ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss