cross subnet traffic drop after edge bridge fail back from exiting maintenance mode
search cancel

cross subnet traffic drop after edge bridge fail back from exiting maintenance mode

book

Article ID: 375758

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Overlay VMs need to reaching external VLAN gateway and VLAN backed VMs via edge bridging for cross-subnet(different subnet) communication.
When active bridging enters MM(Maintenance Mode), traffic is redirected to the new active bridge HA peer without problem. When the edge exits MM and becomes active, traffic loss is observed.

Traffic loss from seconds to minutes after the active bridge fail-back.

Environment

VMware NSX

Cause

The root cause is when a large amount of VLAN mac addresses need to be synced from active edge to standby edge, the mac-sync full-sync message processing logic hits the limit of the edge software learning queue (queue size 512 per edge) and results in mac loss.


Resolution

Workaround:

1. If the total VLAN workload exceeds 500 VLAN MAC addresses, use multiple edge bridge clusters to carry these workloads.

2. After edge exit MM, issue manual mac-sync re-sync command to make sure the mac-sync table is synced between the bridge HA pair.

edge-appctl -t /var/run/vmware/edge/dpd.ctl  mac-sync/request-sync <bridge port uuid>
edge-appctl -t /var/run/vmware/edge/dpd.ctl  mac-sync/show-table <bridge port uuid> | json_pp

Resolution:

The resolution is to move full-sync processing from fast path threads to the slow path thread so that the learning queue only handles messages with batched Mac addresses.
The fix allows each edge to process up to 7000 Mac addresses during full sync.

Fixed Version - 4.2.1