This error indicates that one or more disk groups are experiencing congestion. The disk groups that are experiencing congestion are listed, along with the type of congestion.
What does congestion mean?
Congestion is a feedback mechanism to reduce the rate of incoming IO requests from the vSAN DOM client layer to a level that the vSAN disk groups can service. This reduction of the incoming IO request rate is done by introducing an IO delay that is equivalent to the delay the IO would have occurred due to the bottleneck at the lower layer. Thus, it is an effective way to shift latency from the lower layers to the ingress without changing the overall throughput of the system. This avoids unnecessary queuing and tail dropped queues in the vSAN LSOM layer and therefore avoids a lot of wasted CPU cycles in processing IO requests that might eventually be dropped. Hence, regardless of the type of congestion, temporary and small values of congestion are usually OK, and not impactful to the system performance. However, sustained and large values of congestion may lead to higher latency and lower throughput than desired, and therefore warrant attention and resolution in order to get a better benchmark performance.
How is congestion reported?
vSAN measures and reports congestion as a scalar value between 0 to 255. The IO delay introduced increases exponentially as the congestion value gets higher.
What are the possible ways to deal with congestion?
Check if the congestion is sustained and high (>50). In many cases, a high value of congestion is due to a misconfigured or poorly performing system. If you consistently see a high value of congestion, check the following:
- The maximum supported queue depth in the IO controller and the devices. A maximum supported queue depth lower than 100 may cause issues. Please check whether the controller is certified in the vSAN HCL list.
- Incorrect versions of firmware or device driver software. Please refer to the VMware HCL for vSAN compatible software.
- Incorrect sizing. Incorrect sizing of the cache tier disks and memory could lead to high values of congestion.
If the issue is none of the above, you must debug whether the benchmark can be better tuned to reduce congestion. You must pay attention to whether:
- The congestion is present across all disk groups, or if
- One or two disk groups have abnormally higher congestion than the others.
In the case of (1), it is likely that the vSAN cluster backend is unable to handle the IO workload. If possible, the benchmark could be tuned by:
- Turning off some VMs or
- Reducing the number of outstanding IOs/ threads in each VM or
- In the case of write workloads, reducing the size of the working set.
In the case of (2), where congestion on one disk group is far more than the other disk groups in the system, this indicates an imbalance in write IO activity across the disk groups. If this happens consistently, try increasing the number of disk stripes in the vSAN storage policy that was used to create the VM disks.
What are the common types of congestion that are reported, and how can I address them?
The types of congestion and remedies for each type are listed below:
- SSD Congestion: SSD congestion is typically raised when the active working set of write IOs for a specific disk group is much larger than the size of the cache tier of the disk group. In both the hybrid and all-flash vSAN cluster, data is first written to the write cache (also known as write buffer). A process known as de-staging moves the data from the write buffer to the capacity disks. The write cache absorbs a high write rate, ensuring that the write performance does not get limited by the capacity disks. However, if a benchmark fills the write cache at a very fast rate, the de-staging process may not be able to keep pace with the arriving IO rate. In such cases, SSD congestion is raised to signal the vSAN DOM client layer to slow down IOs to a rate that the vSAN disk group can handle.
Remedy: To avoid SSD congestion, tune the size of the VM disks that the benchmark uses. For the best results, we recommend that the size of VM disks (active working set), be no larger than 40% of the cumulative size of the write caches across all disk groups. Please keep in mind that for a hybrid vSAN cluster, the size of the write cache is 30% the size of the cache tier disk. In an all-flash cluster, the size of the write cache is the size of the cache tier disk, but no greater than 600GB.
- Log Congestion: Log congestion is typically raised when vSAN LSOM Logs (which store the metadata of IO operations that have not been de-staged) consumes significant space in the write cache.
Typically, a large volume of small sized writes on a small working set can cause a large number of vSAN LSOM log entries and cause this type of congestion. Additionally, if the benchmark does not issue 4K aligned IOs, then the number of IOs on the vSAN stack get inflated accounting for 4K alignment. The higher number of IOs can lead to log congestion.
Remedy: Check if your benchmark aligns IO requests on the 4K boundary. If not, then check if your benchmark uses a very small working set (a small working set is when the total size of accessed VM disks is less than 10% of the size of caching tier). Please see above on how to calculate the size of the caching tier). If yes, please increase the working set to 40% of the size of the caching tier. If neither of the above two conditions hold true, you will need to reduce write traffic by either reducing the number of outstanding IOs that your benchmark issues, or decreasing the number of VMs that the benchmark is creating.
- Component Congestion (Comp-Congestion): This congestion indicates that there is a large volume of outstanding commit operations for some components resulting from the IO requests to those components getting queued. This can lead to worse latency. Typically, a heavy volume of writes to a few VM disks causes this congestion.
Remedy: Increase the number of VM disks that your benchmark uses. Make sure that your benchmark does not issue IOs to a few VM disks.
- Memory and Slab Congestion: Memory and slab congestion usually means that the vSAN LSOM layer is running out of heap memory space or slab space to maintain its internal data structures. vSAN provisions a certain amount of system memory for its internal operations. However, if a benchmark aggressively issues IOs without any throttling, it can lead to vSAN using up all of its allocated memory space.
Remedy: Reduce the working set of your benchmark. Alternatively, increase the following settings while experimenting with benchmarks to increase the amount of memory reserved for the vSAN LSOM layer. Please note that these settings are per disk group. Also, we do not recommend using these settings on a production cluster. These settings can be changed via esxcli (see KB 1038578) as follows:
/LSOM/blPLOGCacheLines, default=128K, increase to 512K
/LSOM/blPLOGLsnCacheLines, default=4K, tuned=32K
/LSOM/blLLOGCacheLines, default=128, increase to 32K
Ex. esxcfg-advcfg --get /LSOM/blPLOGLsnCacheLines
Value of blPLOGLsnCacheLines is 4096
esxcfg-advcfg --set 32768 /LSOM/blPLOGLsnCacheLines
Value of blPLOGLsnCacheLines is 32768