vSAN Performance Degradation Due to Physical NIC Errors
search cancel

vSAN Performance Degradation Due to Physical NIC Errors

book

Article ID: 404164

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms

The vSAN environment may experience intermittent performance degradation due to underlying physical NIC-level errors on ESXi hosts within the cluster.

Observation:

  • The following observations were made during the period of performance impact:
    • Intermittent vSAN congestion observed.
    • Latency metrics from the vSAN DOM layer remained within acceptable thresholds.
    • High Outstanding I/Os (OIOs) were reported, coinciding with spikes in TCP retransmissions.
    • Physical NIC errors were detected on multiple hosts, particularly on vmnic of the vSAN master host, suggesting a network-layer issue contributing to congestion.
  • Additionally, from vCenter, one or more virtual machines report high Virtual SCSI Latency, on the vSAN performance graphs
  • VM performance impacted
  • You may see the following in the host /var/run/log/vmkernel.log:
    2025-01-22T07:55:19.430Z cpu89:3788044)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
    2025-01-22T07:55:19.430Z cpu89:3788044) [HB state abcdef02 offset 3801088 gen 109 stampUS 6218700352409 uuid <UUID> jrnl  drv 24.82 lockImpl 4 ip <IP ADDRESS>]
    2025-01-22T07:55:20.118Z cpu90:6655841)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
    2025-01-22T07:55:20.118Z cpu90:6655841) [HB state abcdef02 offset 3801088 gen 121 stampUS 6218700352341 uuid <UUID> jrnl  drv 24.82 lockImpl 4 ip <IP ADDRESS>]
    
    2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [39358042:0x45bb3ef77ac0] => Stuck descriptor
    2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [39358042:0x45bb3ef77ac0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete
    2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2133: [39358042:0x45bb3ef77ac0] => op(0x45bafe66cb00), CSN(10322), rangemapKey(5214828), rangeOffset(3801088), rangeLen(4096), retries(0)
    2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2142: [39358042:0x45bb3ef77ac0] => Inclusive commit list empty
    2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [27154287:0x45bb3efc86c0] => Stuck descriptor
    2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [27154287:0x45bb3efc86c0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete

Environment

VMware vSAN (All Versions)

Cause

The root cause of the intermittent congestion was traced to physical-layer errors on network interfaces of the vSAN master host. These errors include:

  • CRC Errors
  • Receive Length Errors
  • Total Receive Errors
  • High Receive Packets dropped

These errors are typically indicative of issues at the physical or link layer, such as:

  • Faulty or damaged cables
  • Switch port issues
  • Incompatible or outdated NIC firmware/drivers

Sample NIC Statistics -

   NIC statistics for vmnic#:
      Packets received: 105755375855
      Packets sent: 108966259970
      Bytes received: 258195390331005
      Bytes sent: 116896956146947
      Receive packets dropped: 0
      Transmit packets dropped: 0
      Multicast packets received: 15709284
      Broadcast packets received: 2908328
      Multicast packets sent: 492497
      Broadcast packets sent: 20034
      Total receive errors: 6324
      Receive length errors: 2
      Receive over errors: 0
      Receive CRC errors: 6308
      Receive frame errors: 0
      Receive FIFO errors: 0
      Receive missed errors: 0
      Total transmit errors: 0
      Transmit aborted errors: 0
      Transmit carrier errors: 0
      Transmit FIFO errors: 0
      Transmit heartbeat errors: 0
      Transmit window errors: 0

/var/log/hostd.log confirms the NIC-level errors:

2025-07-01T11:05:00.021Z warning hostd[2103163] [Originator@6876 sub=Statssvc] Error stats for pnic: vmnic#
--> errorsRx: 6276
--> RxLengthErrors: 2
--> RxCRCErrors: 6260

Resolution

Step 1: Check NIC Statistics on All vSAN Hosts

  • Run the following command on all ESXi hosts in the cluster:

esxcli network nic stats get -n vmnicX

Note: Replace vmnicX with the actual vmnic (e.g., vmnic3, vmnic5).

  • Review the output for each NIC and check for the following parameters:

    • Total Receive Errors
    • Receive CRC Errors

  • If non-zero values are observed consistently across one or more NICs, it may indicate a physical-layer issue.

Step 2: Remediate Identified Issues

If physical errors are confirmed:

  • Verify physical connectivity (check/replace cables).
  • Review switch port status and logs.
  • Ensure NIC firmware and drivers are updated to the latest supported versions for your ESXi release.
  • Engage your network hardware vendor for further investigation if errors persist after basic hardware checks.

 

Additional Information

CRC (Cyclic Redundancy Check) errors in a network indicate that data has been corrupted during transmission. These errors occur when the checksum calculated by the receiving device doesn't match the checksum sent with the data. CRC errors are crucial for maintaining data integrity in networks, and their presence often points to underlying issues like faulty hardware, noise, or transmission problems.

Understand Cyclic Redundancy Check Errors

vSAN Networking – Network Oversubscription 

Edge failovers caused by CPU lockups on the Edge leading to the BFD tunnels\process to time out