Seeing "High pNIC error rate" vSAN alarm on vSAN Hosts with Mellanox NIC Cards
search cancel

Seeing "High pNIC error rate" vSAN alarm on vSAN Hosts with Mellanox NIC Cards

book

Article ID: 403958

calendar_today

Updated On:

Products

VMware vSAN 8.x

Issue/Introduction

This article discusses the issue of unusually high pNIC error rate alarms in vSphere with Mellanox NICs. This issue has been reported with ConnectX-4 and ConnectX-6 NICs using the nmlx5_core driver.

Environment

VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 8.0.x

VMware vCenter Server 7.0.x
VMware vCenter Server 8.0.x

VMware vSAN 7.x
VMware vSAN 8.x

Cause

Due to changes in how the nmlx5_core driver reports Rx errors for Mellanox NICs, ring buffer issues may cause high pNic error alarms. The Rx miss errors are seen when the Rx processing thread is not able to pull out the packets from the Rx ring buffer of the NIC driver.

First In, First Out (FIFO) is the queueing mechanism for a NIC's internal memory. When a NIC hits an "out of buffer" scenario, new packets cannot be processed until the queue buffer is flushed and all IO is discarded before starting new traffic. This buffer queue dump within a five minute window can trigger "High pNIC error rate" vSAN alarm due to the large volume of dropped packets. A momentary queue dump may not cause observable issues, but sustained drops will impact vSAN performance.

 

Resolution

VMware is aware that it is possible that this alarm may be triggered by a single point-in-time queue drop like those mentioned above. However, the impact to a cluster of losing 1% of vSAN packets is potentially large for vSAN customers, and the alert is 'valid' in its calculation, even if missing the overall intent of the alarm. 

Workaround options include:

  1. Ignore the alert unless it triggers more than once
  2. Increase the Ring Buffer size on your NIC(s)


The physical ring buffer of the NICs may be adjusted to reduce the frequency of alerts following Troubleshooting NIC errors and other network traffic faults in ESXi.

Additional Information