Understanding vSAN Performance issues potentially caused by hardware/networking issues
search cancel

Understanding vSAN Performance issues potentially caused by hardware/networking issues

book

Article ID: 370008

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

  • VMware vSAN is a distributed storage solution that is fully integrated into VMware vSphere. By aggregating local storage devices in each host across a cluster, vSAN is a unique, and innovative approach to providing cluster-wide, shared storage and data services to all virtual workloads running in a cluster. While it eliminates many of the design, operation and performance challenges associated with three-tier architectures using storage arrays, it introduces additional considerations in diagnosing and mitigating performance issues that may be storage related.
  • vSAN's performance capabilities depend heavily on the underlying hardware powering the platform. Discrete components such as CPU, storage controllers, storage devices, cache devices, network cards, and network switches all contribute the effective performance capabilities of vSAN. 
  • As vSAN is an object-network based storage system distributed across multiple ESXi hosts, when troubleshooting vSAN performance issues it's not uncommon for a poorly performing disk or host having networking issues to impact the performance of the entire cluster. 
  • The lack of sufficient and predictable performance can not only impact the VMs that run in an environment, but the consumers who use those applications. 

 

 

 

Environment

VMware vSAN

Cause

As vSAN is a distributed file system utilizing recourses on all hosts any delay in communication among any host in the cluster in the form of network related issues or poorly performing disks can impact a vSAN environment.   

This can be seen in a number of ways. 

  • Network event on host(s)
  • Disk-based latency host(s)
  • Congestion on host(s)


When we have a disk / Disk-group that is not preforming up to the rest of the cluster, this can cause delayed I/O impacting the rest of the environment.  This can impact vSAN backend performance and even impact the Guest VMs if not identified and corrected.

Resolution

Properly administer the vSAN environment and staying on top of any vSAN Skyline Health Alerts that may have triggered to address the issue ASAP.

Review the vSAN environment for any issues in vSAN Skyline Health and vCenter performance metrics for vSAN that is causing delayed I/O, this can be a host that is having a network issue or a disk that is showing physical device latency. 

If you have reviewed the entire cluster and have not found a cause for the perceived latency.  Please review the hosts to test for Network loss and Disk latency. 

Network

To review the environment for network latency please reference Troubleshooting the vSAN Network and determine if you have any network loss or high latency if so please work with your network team to correct this. 

Disk

Disks can display latency in several ways but the most common would be via "performance has deteriorated messages seen in the vmkernel logs. "I/O latency increased from average value" can also been seen.

We can see an example of this below were a disk's performance has deteriorated from 2656761 microseconds to 522372 microseconds.

2024-06-12T11:24:01.880Z cpu31:2098014)WARNING: ScsiDeviceIO: 1513: Device naa.5000c5009a063c9b performance has deteriorated. I/O latency increased from average value of 10382 microseconds to 653962 microseconds.
2024-06-12T11:24:01.887Z cpu31:2098014)WARNING: ScsiDeviceIO: 1513: Device naa.5000c5009a063c9b performance has deteriorated. I/O latency increased from average value of 10382 microseconds to 1314049 microseconds.
2024-06-12T11:24:05.576Z cpu39:2098007)WARNING: ScsiDeviceIO: 1513: Device naa.5000c5009a063c9b performance has deteriorated. I/O latency increased from average value of 10382 microseconds to 2656761 microseconds.
2024-06-12T11:24:11.024Z cpu23:2098011)ScsiDeviceIO: 1513: Device naa.5000c5009a063c9b performance has improved. I/O latency reduced from 2656761 microseconds to 522372 microseconds.
2024-06-12T11:24:16.038Z cpu7:2098009)ScsiDeviceIO: 1513: Device naa.5000c5009a063c9b performance has improved. I/O latency reduced from 522372 microseconds to 102114 microseconds.

To correct this please follow How to troubleshoot vSAN OSA disk issues to remove/recreate the impacted disk/disk-group, if the issue repapers please remove the disk/disk-group and reach out to your hardware vendor for a replacement. 

If further vSAN Performance troubleshooting is required follow Collecting vSAN Performance Service data for vSAN performance issues to collect the entire cluster logs and open a case with VMware by Broadcom for further assistance.

Additional Information

Please refer to the following article for additional information on Troubleshooting vSAN Performance