Virtual machines in the vSAN cluster are reporting as invalid - log congestion reported on vSAN diskgroup

Products

VMware vSAN

Issue/Introduction

Symptoms:

All or most of the virtual machines residing on the vSAN datastores are marked as invalid
vCenter server is down as it is residing on the vSAN datastore and is marked as invalid
None of the objects are in inaccessible state though the virtual machines are reporting as invalid and almost all the objects are in healthy state

This can be validated by using the below command:

esxcli vsan debug object health summary get
Health Status Number Of Objects

---

remoteAccessible 0
inaccessible 0
reduced-availability-with-no-rebuild 1
reduced-availability-with-no-rebuild-delay-timer 1
reducedavailabilitywithpolicypending 0
reducedavailabilitywithpolicypendingfailed 0
reduced-availability-with-active-rebuild 26
reducedavailabilitywithpausedrebuild 0
data-move 0
nonavailability-related-reconfig 0
nonavailabilityrelatedincompliancewithpolicypending 0
nonavailabilityrelatedincompliancewithpolicypendingfailed 0
nonavailability-related-incompliance 0
nonavailabilityrelatedincompliancewithpausedrebuild 0
healthy 233
Physical disk issues and congestion issues are reported in the vSAN Skyline Health
esxcli vsan health cluster list
---------------------------------------------------------------------------
Health Test Name Status
Overall health findings red (Physical disk issue)
Physical disk red
Operation health yellow
Congestion red
Component limit health green
Component metadata health green
Memory pools (heaps) green
Congestion levels indicate log congestion. Use the below command to validate the congestion levels:
for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done

Sample output:
52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
memCongestion:0
slabCongestion:0
ssdCongestion:0
iopsCongestion:0
logCongestion:252
compCongestion:0
maxDeleteCongestion:0
mdDeleteCongestion:0
memCongestionLocalMax:0
slabCongestionLocalMax:0
ssdCongestionLocalMax:0
iopsCongestionLocalMax:0
logCongestionLocalMax:252

compCongestionLocalMax:0

mdDeleteCongestionLocalMax:0
Execute the following command to validate vSAN LLOG consumption levels. The output indicates high PLOG consumption.
#while true; do clear; echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo -e "$ssd NOTE: it will not dispay anything if zero" ;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Congestion:"|grep -v ":0";done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo -e "\n $ssd \n";echo " LLOG consumption: $llogGib";echo " PLOG consumption: $plogGib";echo " Total log consumption: $allGibTotal";done;sleep 30; done ; Sample output:
Mon Nov 10 04:17:41 UTC 2025

52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx NOTE: it will not display anything if zero logCongestion:252

52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

LLOG consumption: 0.311802
PLOG consumption: 23.6882
Total log consumption: 24
The default log consumption limits are 16GB and 24GB

esxcfg-advcfg -g /LSOM/lsomLogCongestionLowLimitGB
16

esxcfg-advcfg -g /LSOM/lsomLogCongestionHighLimitGB
24

Environment

VMware vSAN 8.x (applicable for vSAN OSA only)

Cause

Virtual machines are marked as invalid due to very high log congestion on the disk group which is caused due to failed capacity disk in the vSAN disk group.

When the relog on the failed capacity disk does not happen, it causes PLOG buildup leading to congestion and latencies at the VM level. Relog is an internal process of vSAN which is used to free up the space in LSOM layer for log reclamation. Relog does not happen on device if device remains in repair state for long time which might lead to log buildup.

Cause Validation:

From /var/run/log/vsandevicemonitord.log file, below events are reported indicating that DDH has detected the disk has exceeded the IO latency threshold during the monitoring interval.

WARNING - WRITE Average Latency on VSAN device naa.xxxxxxx has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.

Events from vmkernel.log indicate that the data evacuation task is in progress and VOB message is reported indicating that the log congestion threshold is reached.

2025-11-10T04:33:50.291Z In(182) vmkernel: cpu16:2098902)LSOM: LSOMEventNotify:8407: Throttled: Waiting for open component countto drop to zero on disk 52bcxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ----> problematic capacity disk

2025-11-10T04:33:54.275Z In(182) vmkernel: cpu9:2098902)LSOM: LSOMThrowCongestionVOB:482: Throttled: vSAN node <hostname> maximum LogCong in 52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx reached.

Resolution

To address this issue, remove the faulty capacity drive from the disk group

Place the host into maintenance mode with ensure accessibility
Remove the failed capacity drive from the disk group. While trying to remove the failed capacity disk, below error might be encountered since DDH mechanism is also trying to unmount the disk which is hung as there are open component count. In such cases reboot the host to kill the lock and then remove the failed capacity drive from the disk group
A general system error occurred: Failed to get VsanInfo operation lock for diskOpLock, an operation is currently in progress(locked pid: 0), error: /tmp/.vsanDiskOpLock.lock.LOCK: timeout waiting for lock after 30 seconds. Lock is currently held by process 2314628 (vsanesxcmd: /usr/lib/vmware/vsan/bin/vsanesxcmd storage diskgroup unmount -d naa.5000xxxxxxxxxxxxxx)

As soon as the problematic drive is removed from the disk group, the log congestion will be automatically addressed. If the problem persists, please reach out to Broadcom Support.