This article discusses the issue of unusually high pNIC error rate alarms in vSphere with Mellanox NICs. This issue has been reported with ConnectX-4 and ConnectX-6 NICs using the nmlx5_core driver.
VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 8.0.x
VMware vCenter Server 7.0.x
VMware vCenter Server 8.0.x
VMware vSAN 7.x
VMware vSAN 8.x
Due to changes in how the nmlx5_core driver reports Rx errors for Mellanox NICs, ring buffer issues may cause high pNic error alarms. The Rx miss errors are seen when the Rx processing thread is not able to pull out the packets from the Rx ring buffer of the NIC driver.
First In, First Out (FIFO) is the queueing mechanism for a NIC's internal memory. When a NIC hits an "out of buffer" scenario, new packets cannot be processed until the queue buffer is flushed and all IO is discarded before starting new traffic. This buffer queue dump within a five minute window can trigger "High pNIC error rate" vSAN alarm due to the large volume of dropped packets. A momentary queue dump may not cause observable issues, but sustained drops will impact vSAN performance.
VMware is aware that it is possible that this alarm may be triggered by a single point-in-time queue drop like those mentioned above. However, the impact to a cluster of losing 1% of vSAN packets is potentially large for vSAN customers, and the alert is 'valid' in its calculation, even if missing the overall intent of the alarm.
Workaround options include:
The physical ring buffer of the NICs may be adjusted to reduce the frequency of alerts following Troubleshooting NIC errors and other network traffic faults in ESXi.
Packet Drops can negatively impact vSAN performance. A 1% drop rate impacts 10% of IOPs throughput for vSAN, as detailed in vSAN Networking – Network Oversubscription.
Receive Missed Errors detected on Mellanox pNICs
High pNIC error rate, which is exceeding the expected threshold of 100%
Alarm about high pNIC error rate being detected
Mellanox NIC を利用するvSAN ホストにて vSAN アラーム 「物理 NIC エラー率が高くなっています」 が検出される