vSAN Health Service - Physical Disk Health – Congestion
search cancel

vSAN Health Service - Physical Disk Health – Congestion

book

Article ID: 326891

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article explains the Physical Disk Health - Congestion check in the vSAN Health Service and provides details on why it might report an error

Environment

VMware vSAN 6.0.x
VMware vSAN 8.0.x
VMware vSAN 7.0.x

Resolution

Q: What does the Physical Disk Health - Congestion check do?

Congestion in vSAN happens when the lower layers fail to keep up with the I/O rate of the higher layers. If this health status is not green (OK), vSAN is still using the disk, but it is in a state of reduced performance (potentially severely), manifesting in low throughput/IOPS and high latencies for vSAN objects using this disk group. Congestion in these cases will be applicable to all objects on the disk group.

Q: What does it mean when it is in an error state?

Typical reasons for congestion are bad or badly sized hardware, misbehaving storage controller firmware, bad controller drivers, a low queue depth on the controller, or some problems in the software. For example, if the flash cache device is not sized correctly, virtual machines performing a lot of write operations could fill up of write buffers on the flash cache device. These buffers needs to be destaged to magnetic disks in hybrid configurations. To facilitate the now very frequently occurring destaging operations, congestion might be used to slow down the writes from the virtual machine.
 
One common scenario is a high read cache miss rate, which can also lead to congestion and slow down virtual machine read I/O.
 
High congestion could be the root cause of virtual machine storage performance degradation, operation failures, or even ESXi hosts going unresponsive.
 
For more information about this issue, refer to the following articles:

Q: How does one troubleshoot and fix the error state?

Under high load, when vSAN is operating at its maximum performance, a low amount of congestion (typically under a value of 200) is expected and is not a cause of concern. However, any value of congestion above 0 combined with low throughput/IOPS is an indication of an issue. This health check will be green (OK) for congestion values below 200, yellow (warning) for values between 200 and 220, and red (alert) for values above 220. The maximum value for congestion is 255.

Note: The threshold value for earlier versions to 6.7 U1, would continue to be 32 (Yellow) and 64 (Red).
 
VMware recommends to engage VMware Support on congestion related issues to ensure identification of the root cause.


Additional Information

For more information on collecting VMware vSAN logs, see How to collect vSAN support logs and upload to VMware by Broadcom.
 
Following a selection of available KB Articles related to the vSAN Healthcheck: