A host in the vSAN cluster is experiencing SSD congestion exceeding 100, as per KB vSAN Health Service - Physical Disk Health – Congestion anything below 200 is acceptable. However, any value of congestion above 0 combined with low throughput/IOPS is an indication of an issue.
Active rebalance is in progress.
Placing the host into maintenance mode in attempts to elevate the SSD congestion resulted in the congestion migrating to another host in the cluster.
Found some operational congestion on host 2
================================================
Thu Jun 12 16:48:35 UTC 2025
524a7fde-5535-963a-047d-############
memCongestion:0
slabCongestion:0
ssdCongestion:137 ---> [Yellow]
iopsCongestion:0
logCongestion:0
compCongestion:0
mdCongestion:0
memCongestionLocalMax:0
slabCongestionLocalMax:0
ssdCongestionLocalMax:137 ---> [Yellow]
iopsCongestionLocalMax:0
logCongestionLocalMax:0
compCongestionLocalMax:0
mdCongestionLocalMax:0
524a7fde-5535-963a-047d-############
Write buffer size (GB): 600
Write buffer free (GB): 83.868
Write buffer usage (%): 86.022 ---> [Yellow]
esxcli vsan debug resync summary get to see overall resync progress and esxcli vsan debug resync list to see more detailed info such as objects resyncing and to/from hosts.
[root@Host2:~] esxcli vsan health cluster get -t physdiskcapacity
Disk capacity yellow
Checks the free space on physical disks in the vSAN cluster.
Disks with issues
Host Disk Capacity Free Space Rebalance State UUID
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.##.##.14 Local ATA Disk (naa.################) yellow 18.74% (634.82GB of 3387.72GB) Reactive rebalance task is in progress 52c91b22-28a2-5434-d784-############
10.##.##.14 Local ATA Disk (naa.################) yellow 18.71% (639.43GB of 3417.78GB) Reactive rebalance task is in progress 52d42dae-57ae-b422-58d1-############
10.##.##.14 Local ATA Disk (naa.################) yellow 18.71% (639.43GB of 3417.78GB) Reactive rebalance task is in progress 52024175-81d4-081d-0dfd-############
10.##.##.14 Local ATA Disk (naa.################) yellow 18.74% (634.82GB of 3387.72GB) Reactive rebalance task is in progress 52305267-21df-621d-6eb9-############
VMware VSAN [All Versions]
Congestion is a flow control mechanism used by vSAN. Whenever there is a bottleneck in a lower layer of vSAN (closer to the physical storage devices), vSAN uses this flow control mechanism to relieve the bottleneck in the lower layer and instead reduce the rate of incoming I/O at the Client VMs. See KB Understanding Congestion in vSAN for more details.
Placing a host experiencing SSD congestion is not ideal as all it's going to do is potentially migrate the congestion to another host in the cluster especially if there is already a rebalance resync already in progress. By placing the host into maintenance mode you're telling vSAN to no longer use the storage resources of that host and once the delay timer hist the default of 60mins a rebuild resync is introduced on top of the rebalance resync increasing the amount of resync and the time to completion.
If VM performance is impacted due to an active vSAN resync and SSD congestion it's best to throttle the resync to the host suffering from SSD congestion to limit the incoming I/O to the host due resync. This can be done by using command esxcli vsan resync throttle set --level <0-512mb> until you find the sweet spot of VM performance and time to complete the resync. See KB Understanding host level resync management improvements for more details