The vSAN environment may experience intermittent performance degradation due to underlying physical NIC-level errors on ESXi hosts within the cluster.
Observation:
vmnic of the vSAN master host, suggesting a network-layer issue contributing to congestion.2025-01-22T07:55:19.430Z cpu89:3788044)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
2025-01-22T07:55:19.430Z cpu89:3788044) [HB state abcdef02 offset 3801088 gen 109 stampUS 6218700352409 uuid <UUID> jrnl drv 24.82 lockImpl 4 ip <IP ADDRESS>]
2025-01-22T07:55:20.118Z cpu90:6655841)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
2025-01-22T07:55:20.118Z cpu90:6655841) [HB state abcdef02 offset 3801088 gen 121 stampUS 6218700352341 uuid <UUID> jrnl drv 24.82 lockImpl 4 ip <IP ADDRESS>]
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [39358042:0x45bb3ef77ac0] => Stuck descriptor
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [39358042:0x45bb3ef77ac0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2133: [39358042:0x45bb3ef77ac0] => op(0x45bafe66cb00), CSN(10322), rangemapKey(5214828), rangeOffset(3801088), rangeLen(4096), retries(0)
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2142: [39358042:0x45bb3ef77ac0] => Inclusive commit list empty
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [27154287:0x45bb3efc86c0] => Stuck descriptor
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [27154287:0x45bb3efc86c0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete
VMware vSAN (All Versions)
The root cause of the intermittent congestion was traced to physical-layer errors on network interfaces of the vSAN master host. These errors include:
These errors are typically indicative of issues at the physical or link layer, such as:
Sample NIC Statistics -
NIC statistics for vmnic#: Packets received: 105755375855 Packets sent: 108966259970 Bytes received: 258195390331005 Bytes sent: 116896956146947 Receive packets dropped: 0 Transmit packets dropped: 0 Multicast packets received: 15709284 Broadcast packets received: 2908328 Multicast packets sent: 492497 Broadcast packets sent: 20034 Total receive errors: 6324 Receive length errors: 2 Receive over errors: 0 Receive CRC errors: 6308 Receive frame errors: 0 Receive FIFO errors: 0 Receive missed errors: 0 Total transmit errors: 0 Transmit aborted errors: 0 Transmit carrier errors: 0 Transmit FIFO errors: 0 Transmit heartbeat errors: 0 Transmit window errors: 0
/var/log/hostd.log confirms the NIC-level errors:
2025-07-01T11:05:00.021Z warning hostd[2103163] [Originator@6876 sub=Statssvc] Error stats for pnic: vmnic#--> errorsRx: 6276--> RxLengthErrors: 2--> RxCRCErrors: 6260
Step 1: Check NIC Statistics on All vSAN Hosts
Run the following command on all ESXi hosts in the cluster:
esxcli network nic stats get -n vmnicX
Note: Replace vmnicX with the actual vmnic (e.g., vmnic3, vmnic5).
Review the output for each NIC and check for the following parameters:
Step 2: Remediate Identified Issues
If physical errors are confirmed:
CRC (Cyclic Redundancy Check) errors in a network indicate that data has been corrupted during transmission. These errors occur when the checksum calculated by the receiving device doesn't match the checksum sent with the data. CRC errors are crucial for maintaining data integrity in networks, and their presence often points to underlying issues like faulty hardware, noise, or transmission problems.
Understand Cyclic Redundancy Check Errors
vSAN Networking – Network Oversubscription
Edge failovers caused by CPU lockups on the Edge leading to the BFD tunnels\process to time out