Isolating network issues to a physical host NIC

Products

VMware vSphere ESXi VMware NSX

Issue/Introduction

NIC failures can be deceptive when the vSphere OS is not detecting any issues, such as link state down criteria 128.
The NIC driver provides most of the health information needed for detection of physical NIC issues.
This article addresses events when the vSphere OS successfully passes packets to the NIC, yet the packets do not arrive at the next hop. The physical link state reports as up.

Environment

VMware NSX
Vmware vSphere ESXi

Cause

To detect whether the issue is the NIC or the physical cable between NIC and the next hop interface, captures at the host and the next hop upstream switch are necessary.

Starting with point 1, this diagram shows an ARP request leaving a virtual machine, traversing the virtual switch, getting sent to the physical NIC, and leaving the ESXi host. This is the standard expected behavior for all packets leaving an ESXi host and entering the physical network.

Testing at the host:

Ensure Source VM is sending packets to the vswitch
Testing Setup:
Start a continuous ping from the source VM to its gateway
Identify switchport number:
esxcli vm process list | less
Search list for VM name and record the World ID (WID)
esxcli network vm port list -w <WID>
Record the switchport number and uplink.
Alternatively esxtop option "n" can be used to identify the switchport and uplink if the list is short.
Packet captures
1. Packet Trace:
  1. The trace shows what points the packet traverses through the virtual network datapath (vSwitch).
  2. The ARP request leaves the VM via the VnicTx point. It passes through the ESXi host network stack to the UplinkSndKernel point, which is the last point where vSphere can see it.
  3. This is where the packet is handed off to the physical NIC of the host.
  4. This command is proof of the handling of the packet.pktcap-uw --trace --ip <IP of VM or destination>
    
    The assumption made at this point is the packet has been given to the NIC and the NIC has sent it to the next upstream interface.
2. Switchport capture:
  
  pktcap-uw --swithchport <switchport ID> --capture VnicTx,VnicRx --ng -o - |tcpdump-uw -enr -This is a bidirectional capture showing the VM's transit of packets, and the VM's receipt of incoming packets.
  Focus on what packets are visible. Note if ARP packets are only ARP request "who has ..."
3. Uplink Capture:
  pktcap-uw --uplink <vmnicX> --capture UplinkSndKernel,UplinkRcvKernel --ng -o - |tcpdump-uw -enr -
  This is a bidirectional capture showing the ESXi host's transit of packets, and the host's receipt of incoming packets.
  Note if the ARP request packet is seen at UplinkSndKernel and if there is an ARP reply seen at UplinkRcvKernel.
  
  Based on the observations, the next steps will require similar testing and captures performed at the next upstream switch interface.
  If the uplink capture on the host shows ARP requests UplinkSndKernel capture point and no ARP reply at UplinkRcvKernel, then it is assumed that the NIC sent the packet upstream.
  The upstream test is performed by persons that have access to the upstream physical switch.
  
  There are two scenarios that can occur:

1. - The ARP arrives and the request is responded to with an ARP reply.
  - The ARP does not arrive and no ARP reply is generated.
    
    This scenario is when it is assumed that the ARP request was forwarded by the NIC to the switch. Testing at the switch shows a healthy switch interface with no indication of the connection medium having issues statistically (i.e. There are no CRCs seen incrementing at the interface). No ARP reply will be generated since no ARP request arrives at the interface.
    
    The ARP request is seen at the UplinkSndKernel capture point, thus the ESXi OS has delivered the packet to the physical NIC driver and the assumption is made that the packet is now on the wire heading to the next switchport upstream.
    
    However, the upstream switchport is not receiving the ARP request, and testing has found the upstream switchport to be healthy and the connection medium to be error free. This leaves only the physical NIC as suspect. The next check is to migrate the test source VM to another host and determine if the issue follows it to the next host. If the issue does not follow the VM, then the issue is isolated to the original host. If the issue moves with the VM, then more investigation is needed.

Resolution

Replace the physical NIC in the ESXi host.

Additional Information

This method does not cover what might be seen in the nicinfo.sh data or logs.
This is a method for troubleshooting this specific issue live.
There will not be any criteria 128 error messages seen in the host's vmkernel.log or vobd.log.
Packet captures are assumed to have been used to verify the VM is sending ARP requests.
Packet captures show ARP requests leaving the capture point UplinkSndKernel towards the upstream switch.