Virtual machines hosted on VSAN are experiencing slowness
search cancel

Virtual machines hosted on VSAN are experiencing slowness

book

Article ID: 415001

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Virtual machines residing on VSAN are experiencing latency every week during a specific time interval

  • At the same time there are backup jobs running in the backend

  • Latencies are seen only while the vm is on VSAN datastore. There were no issues reported while the virtual machines were residing on VMFS or NFS volumes

  • There are no alerts reported in the VSAN skyline health 

Environment

VMware VSAN 8.x

Cause

The performance degradation or latencies are observed whenever there is a spike in the virtual machine iops. We see that there is a RDT network latency reported whenever there is a spike in iops causing latency at cluster level due to which users are reporting latency at the virtual machine layer

Cause Justification:

On using IO trip analyzer to validate the latency issues, we can see latency reported at the network layer indicating network hardware issues or network congestions

To use IO trip analyzer please refer: Use vSAN I/O Trip Analyzer

For example, in an environment with underlying issues at the network layer, the IO trip analyzer will report the areas of issues in Red

If we click on the red icon it will provide further details indicating which layer is having an issue. In the above case, issue is seen at the networking layer. On clicking on the red icon it displays the below details

In addition to this, on processing the VSAN performance data in humbug we can see that latency is reported at the cluster level whenever there is a spike in the IOPs. (Please collect the VSAN performance data by following the steps documented in Collecting vSAN Performance Service data for vSAN performance issues and reach out to Broadcom support to process the data in Humbug)

After processing the data in humbug, under DOM OSA & ESA select DOM OSA & ESA: Global client and you can see latency reported at the cluster level

As you can see from the below snippet, we notice latency spikes only when there is a spike in the IOPS and there are outstanding IOs

On checking further, we can see very high RDT latency under RDT: Host Stats indicating that the underlying network is not able to handle the load

From the above snippets, it is clear that whenever there is a spike in IOPS, the RDT network latency also increases, leading to a higher number of outstanding IOs and, consequently, latency at the cluster level.

Resolution

The observed latency directly correlates with spikes in IOPS, indicating that the network is unable to handle the increased load. If this workload is expected, the current network configuration may not be sufficient and should be reviewed with your solution architect. If the network should support this workload, engage your networking team to investigate potential bottlenecks or performance constraints at the network layer.