Connectivity issue between two VMs running under heavy load and using uplinks that are part of a LAG.
search cancel

Connectivity issue between two VMs running under heavy load and using uplinks that are part of a LAG.

book

Article ID: 380622

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • VMs are part of two different vlan portgroups and are running on two different hosts.
  • The ESXi uplinks are part of a LAG.
  • The ESXi vmnic adapters are Broadcom BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller and running the firmware version higher than 219.x.
  • The VMs are under heavy load.
  • In the above scenario, when one of the uplink fails, there is in interruption in connectivity between the two VMs. The interruption is for about 40-60 seconds. Post which, traffic automatically resumes.
  • During the time of the issue, either the arp resolution does not complete or, if the arp resolution is complete, the icmp echo replies will not be reaching the source vm (not only icmp, all the traffic is affected).
  • The arp reply or the icmp echo reply is seen on the source vm uplink. However, it does not reach the source vm switchport because, the packet will have a wrong vlan tagged.

Environment

VMware vSphere ESXi

Cause

The issue is caused by VEB (Virtual Ethernet Bridging) mode. As per the NIC documentation - https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Configuration-adapter/tunneling-configuration-examples.html#tunneling-configuration-examples_title__bookmark256, this mode generates an internal bridge within the NIC for VM-to-VM communication. The Ethernet frames traverse through the internal bridge. So they may skip going down the wire to the physical switch if the destination is also on the same physical NIC (even if the port/vmnic is different). And without going to the physical switch, the same/wrong VLAN is used and packets are not received. RX filter programed on the vmnic may affect this, so high load (which causes RX filters applied on a non-default queue) may have an impact.

Resolution

VEB is generally used in SRIOV environments. If SRIOV is not in use, it can be safely disabled. Referring to below NIC settings from the server BIOS, to disable VEB, we can set the "Default EVB Mode" to "None".