Overlay VMs need to reaching external VLAN gateway and VLAN backed VMs via edge bridging for cross-subnet(different subnet) communication.
When active bridging enters MM(Maintenance Mode), traffic is redirected to the new active bridge HA peer without problem. When the edge exits MM and becomes active, traffic loss is observed.
Traffic loss from seconds to minutes after the active bridge fail-back.
VMware NSX
The root cause is when a large amount of VLAN mac addresses need to be synced from active edge to standby edge, the mac-sync full-sync message processing logic hits the limit of the edge software learning queue (queue size 512 per edge) and results in mac loss.
Workaround:
1. If the total VLAN workload exceeds 500 VLAN MAC addresses, use multiple edge bridge clusters to carry these workloads.
2. After edge exit MM, issue manual mac-sync re-sync command to make sure the mac-sync table is synced between the bridge HA pair.
edge-appctl -t /var/run/vmware/edge/dpd.ctl mac-sync/request-sync <bridge port uuid>
edge-appctl -t /var/run/vmware/edge/dpd.ctl mac-sync/show-table <bridge port uuid> | json_pp
Resolution:
The resolution is to move full-sync processing from fast path threads to the slow path thread so that the learning queue only handles messages with batched Mac addresses.
The fix allows each edge to process up to 7000 Mac addresses during full sync.
Fixed Version - 4.2.1