vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

Products

VMware vSAN

Issue/Introduction

This article explains the vSAN performance diagnostics issue, "The vSAN cache may not be sized correctly" and what are possible solutions to address the issue.

Symptoms:
You see a message in vSAN performance diagnostics that says:

The vSAN cache may not be sized correctly

Environment

VMware vSAN 6.6.x
VMware vSAN 6.7.x

Cause

This message is an indicator that the read performance in a hybrid vSAN cluster may be limited by the vSAN caching tier. In a hybrid vSAN cluster, the read cache stores the data read from the hard disk drives (HDDs), so that future reads do not get affected by the hard disk latency. The following metrics indicate if read cache is not performing optimally:

Read Cache Hit Rate
Evictions
Cache Invalidations.

A screenshot of this issue follows:

Resolution

What are possible ways to address low Read Cache Hit Rate?

The Read Cache Hit Rate metric indicates what percentage of reads are delivered from the read cache for the specified disk group. A low number for Read Cache Hit Rate limits the read performance as more IOPS are fetched from the hard disk tier. If you encounter this message, please consider the following steps:

Note the time duration for which Read Cache Hit Rate is below the threshold. If the time duration is small, and the Read Cache Hit Rate is continuously rising, it may mean that the cache is warming up, and the performance will improve with time. This is expected, and there is no further action needed at this time other than monitoring the performance of read IOPS. A screenshot below shows the Read Cache Hit Rate increasing to 100% with time. In this case, no further action is required.
Check if the Read Cache Hit Rate is uniform across all disk groups, or is lower in one disk group compared to the others. If the Read Cache Hit Rate is much lower on one disk group compared to the others, it implies that the read IO pattern is imbalanced across the disk groups in the cluster. In such cases, the performance may be improved by increasing the “Number of disk stripes per object” in the vSAN storage policy. Read more about vSAN storage policies.
If the Read Cache Hit Rate is low and uniform across all disk groups, then it is likely that the working data set in use cannot be cached in the vSAN caching tier. In such a case, you should increase the size of the vSAN caching tier by adding more disk groups. Alternatively, you can tune the working set of the benchmark by doing one of the following:

- Decrease the number of active VMs on this cluster
- Reduce the number of VM disks accessed by the benchmark
- Reduce the size of accessed data in the case of the benchmark

What are possible ways to address high evictions?

Evictions are an indicator of how much the read cache contents are evicted because of the cache being fully populated. Evictions typically mean that the working set size is larger than the size of the read cache. Please follow steps (2) and (3) in the above question for a possible remedy. The following screenshots shows high evictions:

What are possible ways to address high cache invalidations?

Cache invalidations are an indicator for the number of writes on the same address offset as an existing data in the read cache. When a write operation to an IO address follows a read operation, the contents of the read cache must be updated. Such an eviction is referred to as a cache invalidation. When there are too many evictions, read performance can be affected because the cache is in constant churn. If you encounter cache invalidations, then consider the following options:

Note the time duration for which cache invalidations are above the threshold. If the time intervals are small and infrequent, this is expected, and you probably do not need to take any action.
Check if the cache invalidations are uniform across all disk groups, or if they are much higher in one disk group compared to the others.
Tune your benchmark to avoid writes after reads on the same offset. As an example, a benchmark that does sequential writes will not run into cache invalidations