Troubleshooting ESXi host or Virtual Machines loosing Network connectivity specifically during Network failback
search cancel

Troubleshooting ESXi host or Virtual Machines loosing Network connectivity specifically during Network failback

book

Article ID: 405043

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

  • Network reachability problems can occur for Virtual Machines (VMs) or VMkernel adapters on an ESXi host after a failed physical uplink recovers and traffic fails back to it.
  • While initial traffic failover to a remaining active uplink typically proceeds without disruption, connectivity issues may manifest immediately following the failback event.
  • Failback is set to Yes on Portgroup.

Environment

VMware ESXi

VMware vCenter Server

Cause

This issue arises from a potential mismatch in the network path observed by the ESXi host and the upstream network device (e.g., physical switch or router) during a failover and subsequent failback event.

Consider the following sequence:

  1. A VM (e.g., "Test-VM-01") is initially sending and receiving traffic via vmnic0.
  2. vmnic0 experiences a failure. Network teaming rules trigger a failover, and "Test-VM-01" begins sending and receiving its traffic via vmnic1. During this phase, connectivity remains stable.
  3. vmnic0 recovers and comes back online. The ESXi host's teaming policy initiates a failback, and "Test-VM-01" is directed to resume sending its traffic via vmnic0.
  4. At this point, the ESXi host expects all traffic for "Test-VM-01" to arrive on vmnic0. However, the upstream network device (or the sender, such as a jumpbox initiating a ping) may still be sending incoming packets for "Test-VM-01" towards the ESXi host via vmnic1.
  5. Since the ESXi host is no longer expecting traffic for "Test-VM-01" on vmnic1 (it's expecting it on vmnic0), the packets arriving on vmnic1 for "Test-VM-01" are filtered by ESXi and does not reach Test-VM-01. This results in network reachability issues for "Test-VM-01".

Resolution

This issue can be easily verified by running a simultaneous packet capture at vmnic0 and vmnic1. Following steps can be performed:

  1. Open a SSH session to the ESXi host.

  2. Run the following packet capture command:

    1. pktcap-uw --uplink vmnic0 --dir 2 --ng -o /vmfs/volumes/<datastore_name>/vmnic0.pcapng & pktcap-uw --uplink vmnic1 --dir 2 --ng -o /vmfs/volumes/<datastore_name>/vmnic1.pcapng

  3. Initiate a continuous ping from outside to the Virtual Machine.

  4. Verify which uplink the Virtual Machine is using by running the following command on a duplicate SSH session of the Host:

    1. netdbg vswitch instance list

    2. Let's say this VM is using vmnic0 to pass the traffic.

  5. Bring down vmnic0 using the following command:

    1. esxcli network nic down -n vmnic0

  6. Verify that the Virtual Machine is now using vmnic1 to pass the traffic using the following command:

    1. netdbg vswitch instance list

  7. Bring up vmnic0 using the following command:

    1. esxcli network nic up -n vmnic0

  8. You would start observing ping drops at this stage.

  9. Verify that the Virtual Machine is now using vmnic0 to pass the traffic using the following command:

    1. netdbg vswitch instance list

  10. Wait for the pings to be successful.

  11. Run the following command to end the packet captures:

    1. kill $(lsof |grep pktcap-uw |awk '{print $1}'| sort -u)

  12. Copy the capture files in your system from /vmfs/volumes/<datastore_name> and open them using Wireshark.

While reviewing the capture files, you can observe a RARP packet is sent out by Virtual Machine when vmnic0 was brought up announcing that the Virtual Machine is now reachable over vmnic0. However, ICMP Echo requests were still being received over vmnic1.

To troubleshoot this further, please engage your Networking team or Switch vendor.