VM cannot connect to a specific Destination IP over a specific VMNIC within a blade server chassis

Products

VMware vSphere ESXi

Issue/Introduction

A virtual machine resident in an ESXi host within a blade server chassis can ping any IP address and its default gateway, but can not ping a specific destination IP.

Initial state of the environment when issue is occurring
VM01 located on ESXiHost-1 attempts to ping another IP address and fails.
VM01 then pings another and is successful. It can also ping its default gateway successfully.
The VM is being support by VMNIC1 for all network traffic.
Network redundancy may or may not be part of this situation depending upon the architecture. In this example vmnic0 and vmnic1 are supporting the port group the VM is connected to.

VMNIC1 is shutdown and VMNIC0 is now support the VM's network traffic.
VM01 located on ESXi Host-1 attempts to ping another subnet IP address and is successful.
VM01 then pings other IPs and is successful. It can also ping its default gateway successfully.
The VM is being support by vmic0 for all network traffic.

At this point the issue is now definable as a VM, when supported on vmnic1, attempts to connect to a specific destination IP address it fails. When vmnic0 is supporting the same VM, it can connect to the specific destination IP address.

The vmnics are virtualized physical NICS present to the ESXi server by the blade chassis. The chassis typically has an A and B Fabric Interconnect (FI) topology. Often the vmnics are connected for redundancy, one to FI-A and the second to FI-B.

This is a generic example of a blade server chassis for illustration.

Environment

ESXi hosts supported in blade chassis

Cause

Environment issues with the blade chassis. The issue is often found with the interconnects; however, different component issues with the chassis can affect data path. This is investigation would be outside of the scope of this article and VMware by Broadcom.

Resolution

Troubleshooting will consist of proving that packets are leaving the ESXi host via the involved vmnic uplinks. The uplinks are mapped directly to vmnics.

Open a console in the virtual machine's guest OS. Perform a ping to the VM's default gateway IP.

Commands:
-pktcap-uw (ESXi)
-traceroute / tracert (Guest OS)
-ping (Guest OS)

Testing:

ICMP traffic may not be allowed in some environments for security reasons. It will be necessary to identify what traffic can be generated for testing purposes to capture packets arriving and leaving from the vmnic to the destination IP address. For this example ICMP is being used.

Perform a ping to the problematic destination IP
Ping fails to the problematic destination IP
Ping the default gateways of the VM and Destination host.
Does the source VM successfully ping its own default gateway
Does it successfully ping the destination default gateway

Record outcomes. Essentially, if we successfully ping 8.8.8.8 and the source default gateway, the ESXi host networking has been validated as working per design. Once vSphere hands the packet off to the default gateway, vSphere's responsibility has been met. The next hops are outside of any influence the ESXi host has to forward packets. The physical infrastructure now takes over delivery of the packets to the destination. The following tests are for the benefit of the customer to show that the ESXi hosts are performing as designed and packets are leaving as expected. It is proving that the issue is upstream of the ESXi host.

VMNIC1 (Problematic VMNIC)
Perform a traceroute and observer what hops are displayed in the output.
Do you see the first hop (Default gateway)
Do you see other hops after the first hop
Record the above outcomes
VMNIC0 (Known Good VMNIC)
Perform a traceroute and observe what hops are displayed in the output
Do you see the first hop (Default gateway)
Do you see other hops after the first hop
Record the above outcomes and compare them to that of the problematic VMNIC

Using the capture command native to ESXi, validate that ICMP requests and replies are present. The source VM will be sending continuous pings to the problematic destination IP Address.

pktcap-uw --uplink vmnicX --capture UplinkSndKernel,UplinkRcvKernel --ip <IP of Problematic VM> --proto 0x01 -c 100 -o - |tcpdump-uw -enr - -nn

When the issue is occurring, only ICMP requests will be seen. No ICMP replies will be seen. In our example the vmnic1 is problematic so we see only requests.
Next, change the IP of continuous ping to 8.8.8.8 Google Public DNS address. The same pktcap command is used. The capture will show ICMP requests and replies coming and going from the vmnic.
At this point it can be said that the issue is not within ESXi. The packets are handed off to the chassis interconnect via the Virtual Interface Card (VIC) which in turn hands the packets to the Fabric Interconnect (FI). ESXi has performed its duties and the packets are now in the realm of the chassis packet forwarding functions.