Packet loss and/or disconnection in some virtual machines is suspected - determining scope and troubleshooting approach
search cancel

Packet loss and/or disconnection in some virtual machines is suspected - determining scope and troubleshooting approach

book

Article ID: 423469

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware NSX

Issue/Introduction

In troubleshooting, "scope" refers to defining the boundaries and extent of the problem or environment being investigated. It is a critical first step to prevent wasting time on irrelevant areas and to stay focused on the issue at hand.  Defining the scope involves asking specific questions to determine exactly what is working, what isn't, and under what circumstances the problem occurs.

When packet loss or disconnection is suspected, the root cause is not always related to VMware networking.

Here are some examples of possible root causes not related to VMware networking:

  1. The VM is powering off and on, or being rebooted.

  2. The VM scheduling on the ESXi hypervisor is intermittent or failing.

  3. The internal networking stack within the guest O/S is processing packets in a dysfunctional way.

Resolution

  1. If the VM is powering off and on, or being rebooted, then investigate for why this is happening. 

  2. If the VM scheduling on the ESXi hypervisor is intermittent or failing, then this may be because there are VMFS heartbeat issues affecting VM file(s).  These can be determined by ESXi log review. Refer to KB 318897 Understanding lost access to volume messages in ESXi.

  3. The packet flow in and out of a VM can be determined using techniques in KB 341568 Packet capture on ESXi using the pktcap-uw tool.

    • Whatever packets are sent by the guest can be observed at capture point VnicTx, and if those packets are also seen at capture point UplinkSndKernel, then the ESXi networking stack can be ruled out. 

    • This is also true if packets received by the ESXi host at UplinkRcvKernel, are also seen at capture point VnicRx.

If none of the above is suspected, the next step is to document the following 3 aspects of the symptoms:

  1. Frequency -- how often, and on what schedule, are the symptoms observed

  2. Duration -- for how long are the symptoms observed

  3. Spread -- how widespread are the symptoms (for example, a VM or subset of VMs, an ESXi host or subset of ESXi hosts, a cluster or cluster(s), a datacenter or datacenter(s), a vCenter or vCenter(s).

Once these are documented, open a case and provide the details along with host logs for the ESXi hosts where the symptomatic VMs reside as per KB 142884 Creating and managing Broadcom support request (SR) cases.