In a 6-node vSAN cluster, virtual machines on one host are experiencing high write latency from the guest level, as observed in the performance section, under the monitor tab of the affected host in the vCenter UI . Latency spikes of over 200 ms are observed approximately every 1-2 minutes. In contrast, the remaining nodes in the cluster are operating within normal parameters, with write latency consistently below 1 ms.
vSAN OSA, ESXi 8.0U3.
High TCP error rates were observed on the interface used for vSAN traffic of the impacted host, including duplicate and out of order packets with bursts corresponding to IO and throughput increases. The error rates indicate that network stability contributed to the vSAN storage latency experienced on the host.
TCP errors originate outside of the ESXi software stack. Seek further guidance from your networking team, NIC vendor, or network vendor.
The affected host is showing noticeable spikes in tcpSackRecvBlocksRate, tcpRcvDupAckRate, and tcpTxRetransmitRate, indicating potential network congestion or packet loss. These metrics suggest issues with TCP retransmissions and duplicate acknowledgments, which could be contributing to the observed write latency spikes.
The tcpSackRecvBlocksRate metric refers to the rate at which TCP Selective Acknowledgment (SACK) blocks are received.
The tcpRcvDupAckRate measures the rate at which the host receives duplicate TCP acknowledgments.
The tcpTxRetransmitRate represents the rate at which the host retransmits TCP segments.
Engage the physical network vendor to perform a thorough verification of the network infrastructure, including diagnostics for potential hardware faults such as defective SFP modules, damaged or improperly seated cables, and any other layer 1 connectivity issues.