VMs on vSAN cluster experience high read latency.
On investigation of vSAN Performance graphs (vSphere Client > vSAN Cluster > Monitor > vSAN - Performance ) there is no latency issues found on the vSAN physical disks.
But vSAN backend latency graph shows high read latency:
VMware vSAN OSA 7.x (Hybrid cluster - SSD cache and HDD capacity devices)
VMware vSAN OSA 8.x (Hybrid cluster - SSD cache and HDD capacity devices)
High read latency on Hybrid vSAN OSA cluster can be caused by improperly sized cache.
In Hybrid vSAN OSA clusters, the flash caching device must provide at least 10 percent of the anticipated storage that virtual machines are expected to consume, not including replicas such as mirrors.
In Hybrid vSAN OSA clusters, both read and write occur from the cache device.
If requested data by VM/workload is not present on the cache device, it must be retrieved from the capacity device to the cache and then sent to the VM/workload.
When the cache to capacity ratio is low (less than 10%), this can increase read latency.
For example:
A 2 node vSAN cluster with 2 disk groups.
Each disk group has a cache device of 1.45 TB and 4 5.46 TB capacity devices.
In a Hybrid vSAN OSA 7.x cluster, only 600 GB of the cache device can be utilized, same goes for Hybrid vSAN OSA 8.x clusters. The higher vSAN cache support is only for All-Flash vSAN configurations.
Thus the total cache for the vSAN cluster is 600 GB * 2 * 2 / 1024 = 2.34 TB
The total capacity of the cluster is 5.46 TB * 4 * 2 * 2 = 87.36 TB.
For example, let 3.64 TB of each capacity disk be consumed, then the total utilization of the vSAN datastore is 58.24 TB.
The datastore utilization after omitting the replica (RAID1) would be 58.24/2 = 29.12 TB.
Finally, the cache to capacity ratio in this cluster would be (2.34/29.12)*100 = 8.03 %
This is less than the minimum of 10%.
To resolve this issue, the below options are available:
vSAN cluster's cache can be increased by adding more disk groups to the hosts.
Migrate VMs off of the impacted Hybrid vSAN datastore, this way the amount of datastore utilization would decrease.
Of course this is not ideal, and option 1 should be considered, but option 2 can be considered to immediately help with the read latency issue.
Refer below documents for more information: