Troubleshooting network receive traffic faults and other NIC errors in ESXi

Products

VMware vSphere ESXi

Issue/Introduction

You may notice that there are some counters that are greater than zero, on one or more physical network adapters (vmnics) on one or more ESXi hosts, when you run a command like the following, at the command line, having logged in via SSH or a KVM server console with root privileges:

esxcli network nic stats get -n vmnic2

NIC statistics for vmnic2
Packets received: 701280499176
Packets sent: 687061948450
Bytes received: 664124780523852
Bytes sent: 676938646792793
Receive packets dropped: 2452783244
Transmit packets dropped: 0
Multicast packets received: 976222150
Broadcast packets received: 0
Multicast packets sent: 0
Broadcast packets sent: 0
Total receive errors: 0
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0

The counters above are reset to zero when the ESXi host is rebooted.

To get some context to interpret the magnitude of the numbers, you can run the command "uptime" at the command line to see over what time frame the counters have been incremented.
There is no way to clear these counters without unloading and reloading the device driver, which may produce unpredictable results and possible adverse impacts to workloads and processes on the host.
If you wish to clear the counters, we recommend placing the host into Maintenance Mode and rebooting it once that is complete.
Because the counters are accumulators, there is no way to forensically determine exactly when the counters get incremented.
However, if you wish to, for example, monitor the "Receive packets dropped" counter, you can use the "watch" command at the command line, as in the following example:

watch esxcli network nic stats get -n vmnic2 | grep "Receive packets dropped"

Normally, in a healthy environment, any values with the string "errors" in their description would be zero (or if not, then very small, especially as a percentage of the overall total bytes / packets sent and received.

However, it is possible that at some time since the ESXi host was rebooted, there was a condition that caused some additions to the counters, and whatever condition caused those additions was subsequently cured without a reboot.
So, we suggest that you use a "watch" command in real time such as the example above to see what is happening during your investigation.
The "watch" command will continuously run the command entered (in the above example, esxcli network nic stats get -n vmnic2 | grep "Receive packets dropped"), so you can observe if the counter is increasing in real time.

Note: This solution is for ESXi host physical NICs.

For a similar solution regarding the guest virtual NICs, see Large packet loss in the guest OS using VMXNET3 in ESXi (324556).
Please note that these are physical NICs and the hardware vendor should troubleshoot the physical NICs when errors are present.
In some cases, the cause may be upstream from the physical NICs, along the data path that packets travel in the environment's network infrastructure.

Environment

ESXi

Resolution

If any of the NICs show errors with the above command, open a case with the hardware vendor to troubleshoot the physical NIC errors and follow these steps:

Check that the driver/firmware of the vmnic is up to date. To check the driver, follow Determining Network/Storage firmware and driver version in ESXi (323110). For more information on drivers and firmware, see FAQ: Recommendation for Driver/Firmware (318542)
If the driver is up to date, you may be able to avoid FIFO or Missed errors by increasing the Rx buffer ring size on the physical NIC.

Note: These changes impact network adapter performance and must be validated by the hardware vendor who supplies the network adapter.

FIFO or Missed errors (one or the other, not necessarily both) will increment and accumulate if physical NIC is not able to handle the peak load of incoming packets with current rx ring buffer size.

Use the following commands to check the current rx ring buffer size and maximum (preset):

esxcli network nic ring preset get -n vmnicX
esxcli network nic ring current get -n vmnicX

Review the output and check the current rx ring buffer size compared to the preset maximum rx ring buffer size the NIC supports.

As an example, for a NIC that supports a preset maximum of 4096, you could first try increasing this value to 1024, and then 2048 (if 1024 was not enough to prevent errors), and finally 4096.

Note: Do the following steps with the ESXi host in maintenance mode, to avoid any potential production impact. Use an out-of-band connection (iLO, DRAC, etc.), when changing the rx ring buffer size to ensure you can change it back if the connection is disrupted because of the change.
(Use "-t number" to change the transmit buffer if needed)

esxcli network nic ring current set -n vmnicX -r number

IMPORTANT NOTES:

1) These changes impact network adapter performance and must be validated by the hardware vendor who supplies the network adapter. If there are any questions on the above commands, refer to the hardware vendor.

2) Also, some hardware vendors have ways of increasing the default ring buffer size, as part of "Tuning Guidelines" for "Virtual Interface Cards"