Customers use “Net/TeamPolicyUpDelay” to mitigate traffic blackhole when upgrading their physical switch infrastructure.
When all the uplinks on the DVS flap, the fallback, SHOTGUN mode kicks in for period of “Net/TeamPolicyUpDelay”. During this time frame duplicate traffics are seen causing traffic outage.
During this time frame, packets from a MAC are sent on both uplinks. The underlay physical switches may complain of MAC flap between the uplinks of same ESXi and cause repeated MAC moves. The MAC is ultimately added to the frozen list causing traffic outage.
The packet captures taken on the physical switches confirm that the duplicate packets were sent out by the ESXi.
Environment
VMware NSX VMware vSphere ESXi
Cause
If the teaming policy on the DVS has 'SHOTGUN' method, the expected behavior when no uplinks are available/up is to sent the packets on the vSwitch via all available uplinks. This causes the duplicate packets. Executing:: nsxdp-cli vswitch teaming policy get --dvs-alias <switch_name> flags: BEST_EFFORT SHOTGUN NOTIFY_SWITCH
Along with 'SHOTGUN' teaming policy, if a 'TeamPolicyUpDelay' is set, the uplinks will be set to 'fallback' mode for the delay timer set from the time they actually came up. This is to avoid failing over traffic to an uplink that is not stable. However, the duplicate packets will be sent out via all the uplinks in the event where both the links flap which may not be a desirable behavior from the underlay physical switch perspective. Executing:: nsxdp-cli vswitch runtime get TeamPolicyUpDelay: 1800000 -> 30 minutes, the uplink will not be used for 30 minutes after the link status comes up.
Resolution
Workaround: None
Fixed version: This issue is fixed in NSX 4.2.1.3, 4.2.2 and future release of NSX.
To ignore the shotgun capability in Teaming a config parameter “/Net/TeamingIgnoreShotgun” is introduced in the fixed version releases. The fix adds the advanced config parameter "/Net/TeamingIgnoreShotgun” to prevent the fallback mode in such cases and avoid duplicate traffic.