Virtual machines running in Overlay Segments are losing network connectivity when Edge TEP group is enabled
book
Article ID: 384700
calendar_today
Updated On:
Products
VMware NSX
Issue/Introduction
TEP Groups leverage multiple TEPs on an Edge Node more effectively by performing flow based load balancing of traffic over the TEPs.
This feature offers more bidirectional North-South throughput with a dedicated Edge Node for a Tier-0 gateway.
Issue happens when TEP Group feature is enabled and first VM is connected to a Segment/Logical Switch(LS).
Environment
VMware NSX 4.2
Cause
In an environment where there are multiple Logical Segments (LS) under the same T1 Logical Router(LR) and when first VM is powered on, all the LS and TEP Group configuration are pushed down to the host in the same transaction.
Data Plane(DP) takes time to realize all the networks and then reports the join of the VNIs to Local Control Plane(LCP). At the same time BFD app in LCP sees that the TEP Groups need to be added to the LS span and try to push the BFD configuration to DP.
Since DP has not created the logical networks and the backplane VNI is not reported to LCP yet, BFD mis-categorize the LS UUID for the backplane LS as a routing domain and sends to DP, and this causes the BFD up count not updated for that backplane VNI.
In a case where Edge TEPs are in different underlay subnets than the ESX TEPs, BUM traffic (like ARP resolution of the next hop by the Distributed Routing (VDR) ) on the backplane VNI would be sent to MTEPs in each of the Edge underlay subnets.
MTEP election favors TEPs which have BFP up count greater than 0.
Since all TEP’s have BFD count equal to zero, overlay module will not do MTEP replication of the ARP packets to any Edges and thus North-South communication will be disrupted.