High amount of packet loss observed on physical NIC(s) caused by buffer exhaustion

Products

VMware vCenter Server

Issue/Introduction

If any NICs show errors, consult with the hardware vendor to troubleshoot the physical NIC errors as they are only reported to the ESXi hosts. The purpose of this documentation is to provide additional insight into the drops themselves to possibly aid the vendor, however the drops/errors are still happening outside of ESXi on hardware/software that VMware does not manage.

This article describes packet drops seen on physical NICs (vmnics), specifically where the drops are caused by the physical NIC running out of buffers. Please note that this issue may be seen on only one NIC or multiple NICs on a host. Similarly the issue may be isolated to a single host or is seen on multiple hosts.

At least one VM may be impacted by this behavior if it is using a NIC with the dropped packets.

In certain environments, you may also see an alarm in the vCenter UI for "High pNic error rate". See vSAN -- Alarm about high pNIC error rate being detected.

There may also be logs identified on one or more hosts related to packet loss. See Error stats for pnic reported in the hostd logs.

Running the below command where "vmnicX" is replaced with the NIC identifier in question, e.g. "vmnic3", the counter for "Receive packets dropped" is a value greater than 0,

esxcli network nic stats get -n vmnicX

Further, running the following command which displays the private NIC stats for all the NICs on the host, you may see a counter related to buffers where the value is greater than 0, and often the same as the number of Receive packets dropped noted above.

usr/lib/vmware/vm-support/bin/nicinfo.sh | less

Please be aware that since the above command pulls the driver statistics, which is vendor-specific, it is expected that the output will not look identical between hosts or NICs where different NIC drivers are in-use, with variations even between driver versions also being common.

Further, some vendors do not categorize buffer errors at all in the above stats, while others may name it differently. For instance, many Cisco NICs will label the counter "rx_no_bufs" while Mellanox NICs will track the same errors with a counter labelled "outOfBuffer".

NOTE: There is a known issue with certain Mellanox cards where buffer exhaustion is present, but not tracked in the standard output of "stats get" command above. See Receive Missed Errors detected on Mellanox pNICs after updating the driver or ESXi for more information.

Environment

VMware ESXi (all versions)

Cause

Physical NICs have a queue that is used to "store" packets temporarily while they wait to be processed, and this queue is the buffer, also called a ring. In certain environments or conditions, the buffer gets full before the packets can be passed and so the packets are dropped.

Resolution

Please note that the buffer/ring is not managed by VMware, and is in fact a component of the NIC and its driver. Therefore, it is recommended to engage the server vendor for NIC tuning guidance or other remediations as increasing the ring size or other mitigating steps may lead to unintended issues elsewhere.

See Troubleshooting NIC errors and other network traffic faults in ESXi for more information.

Because this behavior is generally an indicator of a high volume of traffic, while engaging the server vendor it may be worthwhile to also investigate the network for any unexpected increase in network traffic. Although this behavior is often just a consequence of more traffic than what the hardware can handle in the current configuration, it may also occur due to a broadcast storm or other intermittent/unplanned influx of volume.