Virtual machines hosted on VSAN are experiencing slowness

Products

VMware vSAN

Issue/Introduction

Symptoms:

Virtual machines residing on VSAN are experiencing latency every week during a specific time interval
At the same time there are backup jobs running in the backend
Latencies are seen only when the vm is on VSAN datastore. There are no issues reported while the virtual machines are residing on VMFS or NFS volumes
There are no alerts reported in the VSAN skyline health

Environment

VMware VSAN 8.x

Cause

The performance degradation was observed due to a surge in virtual machine I/O which pushed the internal cluster network to its maximum limits.

This heavy load consumed all available bandwidth, which directly caused the RDT network latency.

Because the network was at peak capacity, it began dropping packets, leading to transient transport errors.

These errors forced the vSAN storage layer into a continuous retry state as it struggled to deliver data.

This loop of network congestion and storage retries created a cluster wide bottleneck, resulting in the performance degradation seen at the VM level.

Cause Justification:

Upon using IO trip analyzer to validate the latency issues, it is observed that the latency is reported at the network layer indicating network hardware issues or network congestions

To use IO trip analyzer please refer: Use vSAN I/O Trip Analyzer

For example, in an environment with underlying issues at the network layer, the IO trip analyzer will report the areas of issues in Red

If we click on the red icon it will provide further details indicating which layer is having an issue. In the above case, issue is seen at the networking layer. On clicking on the red icon it displays the below details

In addition to this, on processing the VSAN performance data in humbug it is observed that latency is reported at the cluster level whenever there is a spike in the IOPs. (Please collect the VSAN performance data by following the steps documented in Collecting vSAN Performance Service data for vSAN performance issues and reach out to Broadcom support to process the data in Humbug)

After processing the data in humbug, under DOM OSA & ESA select DOM OSA & ESA: Global client and you can see latency reported at the cluster level

As can be seen from the below snippet, latency spike is noticed only when there is a spike in the IOPS and there are outstanding IOs

Network statistics reveal that increase in the number of packets causes a sudden increase in throughput, nearing the maximum capacity of the 25Gbps uplink (3.125GiB/s). This high utilization causes the transport layer to experience retransmissions and TCP/IP errors as the available bandwidth is almost entirely consumed

On checking further, very high RDT latency is reported under RDT: Host Stats indicating that the underlying network is not able to handle the load

In addition to this, VMK_STORAGE_RETRY_OPERATION events are reported in vsantracesUrgent.log indicating that this is a transient condition.

2026-01-08T16:48:21.441607 [16361809] [cpu73] [OWNER] DOMTraceOwnerObjectDoInitRDT:10548: {'obj': 0x45xxxxxxxxxx, 'objUuid': 'cb67e75e-####-####-####-#############', 'retryCount': 1, 'retryLimit': 40, 'vsan unloading': False, 'status': 'VMK_STORAGE_RETRY_OPERATION'}

From the above performance graphs, it is very evident that there is a sudden increase in the amount of iops received by the VSAN which leads to network saturation. The underlying network saturation forced the vSAN layer into a retry state.

Resolution

To address this issue, the source of the I/O spikes needs to be determined and this lies within the guest level applications or OS processes

Identifying the specific application causing this demand falls outside the scope of ESXi management and requires investigation at the VM guest level.