Virtual machines residing on VSAN are experiencing latency every week during a specific time interval
At the same time there are backup jobs running in the backend
Latencies are seen only while the vm is on VSAN datastore. There were no issues reported while the virtual machines were residing on VMFS or NFS volumes
VMware VSAN 8.x
The performance degradation or latencies are observed whenever there is a spike in the virtual machine iops. We see that there is a RDT network latency reported whenever there is a spike in iops causing latency at cluster level due to which users are reporting latency at the virtual machine layer
On using IO trip analyzer to validate the latency issues, we can see latency reported at the network layer indicating network hardware issues or network congestions
To use IO trip analyzer please refer: Use vSAN I/O Trip Analyzer
For example, in an environment with underlying issues at the network layer, the IO trip analyzer will report the areas of issues in Red
If we click on the red icon it will provide further details indicating which layer is having an issue. In the above case, issue is seen at the networking layer. On clicking on the red icon it displays the below details
In addition to this, on processing the VSAN performance data in humbug we can see that latency is reported at the cluster level whenever there is a spike in the IOPs. (Please collect the VSAN performance data by following the steps documented in Collecting vSAN Performance Service data for vSAN performance issues and reach out to Broadcom support to process the data in Humbug)
After processing the data in humbug, under DOM OSA & ESA select DOM OSA & ESA: Global client and you can see latency reported at the cluster level
As you can see from the below snippet, we notice latency spikes only when there is a spike in the IOPS and there are outstanding IOs
On checking further, we can see very high RDT latency under RDT: Host Stats indicating that the underlying network is not able to handle the load
From the above snippets, it is clear that whenever there is a spike in IOPS, the RDT network latency also increases, leading to a higher number of outstanding IOs and, consequently, latency at the cluster level.
The observed latency directly correlates with spikes in IOPS, indicating that the network is unable to handle the increased load. If this workload is expected, the current network configuration may not be sufficient and should be reviewed with your solution architect. If the network should support this workload, engage your networking team to investigate potential bottlenecks or performance constraints at the network layer.