vSAN Performance Degradation Due to Physical NIC Errors

Products

VMware vSAN

Issue/Introduction

Symptoms

The vSAN environment may experience intermittent performance degradation due to underlying physical NIC-level errors on ESXi hosts within the cluster.

Observation:

The following observations were made during the period of performance impact:
- Intermittent vSAN congestion observed.
- Latency metrics from the vSAN DOM layer remained within acceptable thresholds.
- High Outstanding I/Os (OIOs) were reported, coinciding with spikes in TCP retransmissions.
- Physical NIC errors were detected on multiple hosts, particularly on vmnic of the vSAN master host, suggesting a network-layer issue contributing to congestion.

Additionally, from vCenter, one or more virtual machines report high Virtual SCSI Latency, on the vSAN performance graphs.
VM performance impacted

You may see the following in the host /var/run/log/vmkernel.log:

2025-01-22T07:55:19.430Z cpu89:3788044)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
2025-01-22T07:55:19.430Z cpu89:3788044) [HB state abcdef02 offset 3801088 gen 109 stampUS 6218700352409 uuid <UUID> jrnl  drv 24.82 lockImpl 4 ip <IP ADDRESS>]
2025-01-22T07:55:20.118Z cpu90:6655841)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
2025-01-22T07:55:20.118Z cpu90:6655841) [HB state abcdef02 offset 3801088 gen 121 stampUS 6218700352341 uuid <UUID> jrnl  drv 24.82 lockImpl 4 ip <IP ADDRESS>]

2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [39358042:0x45bb3ef77ac0] => Stuck descriptor
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [39358042:0x45bb3ef77ac0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2133: [39358042:0x45bb3ef77ac0] => op(0x45bafe66cb00), CSN(10322), rangemapKey(5214828), rangeOffset(3801088), rangeLen(4096), retries(0)
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2142: [39358042:0x45bb3ef77ac0] => Inclusive commit list empty
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [27154287:0x45bb3efc86c0] => Stuck descriptor
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [27154287:0x45bb3efc86c0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete

Environment

VMware vSAN (All Versions)

Cause

The root cause of the intermittent congestion was traced to physical-layer errors on network interfaces of the vSAN master host. These errors include:

CRC Errors
Receive Length Errors
Total Receive Errors
High Receive Packets dropped

These errors are typically indicative of issues at the physical or link layer, such as:

Faulty or damaged cables
Switch port issues
Incompatible or outdated NIC firmware/drivers

Sample NIC Statistics -

NIC statistics for vmnic#:
Packets received: 105755375855
Packets sent: 108966259970
Bytes received: 258195390331005
Bytes sent: 116896956146947
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 15709284
Broadcast packets received: 2908328
Multicast packets sent: 492497
Broadcast packets sent: 20034
Total receive errors: 6324
Receive length errors: 2
Receive over errors: 0
Receive CRC errors: 6308
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0

/var/log/hostd.log confirms the NIC-level errors:

2025-07-01T11:05:00.021Z warning hostd[2103163] [Originator@6876 sub=Statssvc] Error stats for pnic: vmnic#
--> errorsRx: 6276
--> RxLengthErrors: 2
--> RxCRCErrors: 6260

Resolution

Step 1: Check NIC Statistics on All vSAN Hosts

Run the following command on all ESXi hosts in the cluster:

esxcli network nic stats get -n vmnicX

Note: Replace vmnicX with the actual vmnic (e.g., vmnic3, vmnic5).

Review the output for each NIC and check for the following parameters:
- Total Receive Errors
- Receive CRC Errors
If non-zero values are observed consistently across one or more NICs, it may indicate a physical-layer issue.

Step 2: Remediate Identified Issues

If physical errors are confirmed:

Verify physical connectivity (check/replace cables).
Review switch port status and logs.
Ensure NIC firmware and drivers are updated to the latest supported versions for your ESXi release.
Engage your network hardware vendor for further investigation if errors persist after basic hardware checks.

Additional Information

CRC (Cyclic Redundancy Check) errors in a network indicate that data has been corrupted during transmission. These errors occur when the checksum calculated by the receiving device doesn't match the checksum sent with the data. CRC errors are crucial for maintaining data integrity in networks, and their presence often points to underlying issues like faulty hardware, noise, or transmission problems.

Understand Cyclic Redundancy Check Errors

vSAN Networking – Network Oversubscription

Edge failovers caused by CPU lockups on the Edge leading to the BFD tunnels\process to time out