VMware vSAN (All Versions)
As vSAN is a distributed file system utilizing resources on all hosts, any delay in communication among any host in the cluster in the form of network-related issues or poorly performing disks or any hardware issues can impact a vSAN environment.
This can be seen in a number of ways.
When we have a disk/disk group that is not performing up to the rest of the cluster, this can cause delayed I/O, impacting the rest of the environment.This can impact vSAN backend performance and even impact the guest VMs if not identified and corrected.
Properly administer the vSAN environment and stay on top of any vSAN Skyline Health Alerts that may have triggered to address the issue ASAP.
Review the vSAN environment for any issues in vSAN Skyline Health and vCenter performance metrics for vSAN that is causing delayed I/O, this can be a host that is having a network issues, hardware issues or a disk that is showing physical device latency.
If you have reviewed the entire cluster and have not found a cause for the perceived latency.Please review the hosts to test for network loss and disk latency.
Network
To review the environment for network latency, please reference Troubleshooting the vSAN Network and determine if you have any network loss or high latency. If so, please work with your network team to correct this.
Disk
Disks can display latency in several ways, but the most common would be via "performance has deteriorated" messages seen in the vmkernel logs. "I/O latency increased from average value" can also be seen.
We can see an example of this below, where a disk's performance has deteriorated from 2656761 microseconds to 522372 microseconds.
YYYY-MM-DD cpu31:209##14)WARNING: ScsiDeviceIO: 1513: Device naa.ID performance has deteriorated. I/O latency increased from average value of 10382 microseconds to 1314049 microseconds.
YYYY-MM-DD
cpu39:209##07)WARNING: ScsiDeviceIO: 1513: Device naa.ID performance has deteriorated. I/O latency increased from average value of 10382 microseconds to 2656761 microseconds.
YYYY-MM-DD
cpu23:209##11)ScsiDeviceIO: 1513: Device naa.ID performance has improved. I/O latency reduced from 2656761 microseconds to 522372 microseconds.
YYYY-MM-DD
cpu7:209##09)ScsiDeviceIO: 1513: Device naa.ID performance has improved. I/O latency reduced from 522372 microseconds to 102114 microseconds.
To correct this, please follow How to troubleshoot vSAN OSA disk issues To remove/recreate the impacted disk/disk-group, if the issue reappears, please remove the disk/disk-group and reach out to your hardware vendor for a replacement.
If further vSAN performance troubleshooting is required, follow Collecting vSAN Performance Service data for vSAN performance issues to collect the entire cluster logs and open a case with VMware by Broadcom for further assistance.
Please refer to the following article for additional information on Troubleshooting vSAN performance.