Understanding vSAN Network TCP/IP Expectations and Thresholds

Products

VMware vSAN

Issue/Introduction

Symptoms

Intermittent vSAN performance degradation or high write latency spikes.

Purpose

This article provides guidance on identifying and interpreting critical TCP/IP error types within a VMware vSAN environment using the net-stats utility.

net-stats is command-line utility on an ESXi host, used for deep network troubleshooting by collecting detailed statistics for VMkernel ports, vmnic uplinks, and virtual machine network adapters, revealing insights into packet flow, worldlets (CPU threads), and resource usage.

Environment

VSAN 8.x, VSAN 9.x

Resolution

If metrics exceed the Red (Critical) threshold:

Check Physical Layer: Inspect NIC statistics for CRC or Receive Length errors (esxcli network nic stats get -n vmnicX). Replace suspected faulty cables or SFP modules.
Monitor Congestion: Check vSAN-specific congestion metrics in vCenter to see if the network layer is throttling storage I/O.

Additional Information

Critical TCP/IP Error Types and Thresholds

vSAN performance is highly sensitive to network health. The following three error types are the most frequent indicators of underlying physical or logical network issues.

Error Type	vSAN Healthy Target	Warning Threshold (Yellow)	Critical Threshold (Red)
Out-of-Order (OO)	< 0.1%	0.1% - 0.5%	> 1.0%
Retransmissions (rexmit)	~0.0%	0.1% - 0.5%	> 0.5%
Duplicate ACKs (dups)	< 0.1%	0.1% - 0.5%	> 1.0%

Out-of-Order (OO) Packets

Explanation: These are data segments that arrive at the destination in a different sequence than they were sent. This typically occurs due to multiple active network paths, varying delays across routes, or switch load-balancing algorithms.
Interpretation: High OO rates without retransmissions suggest all packets arrived safely but at different times. While TCP can reassemble these, high rates introduce "jitter" and processing overhead, which impacts time-sensitive vSAN I/O.

Retransmissions (rexmit)

Explanation: Occurs when the sender assumes a segment was lost—due to a missing acknowledgment within the expected timeout—and resends it.
Interpretation: This is a direct indicator of packet loss in the underlay network, often caused by defective SFP modules, damaged cables, or extreme switch congestion.

Duplicate ACKs (dups)

Explanation: The receiver sends these to indicate it has received a newer packet while still missing an earlier one in the sequence.
Interpretation: Frequent duplicate ACKs trigger "fast retransmit" mechanisms. In vSAN, bursts of these often correspond to throughput increases and can cause guest VM write latency to spike significantly.

Interpreting `net-stats` Output

To analyze real-time TCP health, run the following command on the affected ESXi host:

net-stats -A -t WwQqihV -i 30 -o /tmp/netstats.out

Analyzing "Red Flag" Scenarios

Review the tcptx (transmit) and tcprx (receive) sections of the output.

Scenario A: High Out-of-Order (OO)

tcprx: { "pps": 73769, ... "oo": 558.8 }

Calculation: (558.8 / 73,769) * 100 approx 0.75%
Status: This is approaching the critical 1% threshold. It indicates potential pathing or load-balancing issues in the physical network.

Scenario B: High Retransmission and Duplicate ACKs

tcptx: { "pps": 65647, ... "rexmit": 126.6 }, tcprx: { "pps": 73769, ... "dups": 216.7 }

Calculation (rexmit): (126.6 / 65,647) * 100 approx 0.19%
Calculation (dups): (216.7 / 73,769) * 100 approx 0.29%
Status: While below critical levels, the presence of both retransmissions and duplicate ACKs confirms intermittent packet loss.