All or most of the virtual machines residing on the vSAN datastores are marked as invalid
vCenter server is down as it is residing on the vSAN datastore and is marked as invalid
None of the objects are in inaccessible state though the virtual machines are reporting as invalid and almost all the objects are in healthy state
This can be validated by using the below command:
esxcli vsan debug object health summary getHealth Status Number Of Objects
---
remoteAccessible 0inaccessible 0reduced-availability-with-no-rebuild 1reduced-availability-with-no-rebuild-delay-timer 1reducedavailabilitywithpolicypending 0reducedavailabilitywithpolicypendingfailed 0reduced-availability-with-active-rebuild 26reducedavailabilitywithpausedrebuild 0data-move 0nonavailability-related-reconfig 0nonavailabilityrelatedincompliancewithpolicypending 0nonavailabilityrelatedincompliancewithpolicypendingfailed 0nonavailability-related-incompliance 0nonavailabilityrelatedincompliancewithpausedrebuild 0healthy 233
Physical disk issues and congestion issues are reported in the vSAN Skyline Health
esxcli vsan health cluster list---------------------------------------------------------------------------Health Test Name StatusOverall health findings red (Physical disk issue)Physical disk redOperation health yellowCongestion redComponent limit health greenComponent metadata health greenMemory pools (heaps) greenCongestion levels indicate log congestion. Use the below command to validate the congestion levels:
for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxmemCongestion:0slabCongestion:0ssdCongestion:0iopsCongestion:0logCongestion:252compCongestion:0maxDeleteCongestion:0mdDeleteCongestion:0memCongestionLocalMax:0slabCongestionLocalMax:0ssdCongestionLocalMax:0iopsCongestionLocalMax:0logCongestionLocalMax:252compCongestionLocalMax:0mdDeleteCongestionLocalMax:0
Execute the following command to validate vSAN LLOG consumption levels. The output indicates high PLOG consumption.
#while true; do clear; echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo -e "$ssd NOTE: it will not dispay anything if zero" ;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Congestion:"|grep -v ":0";done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo -e "\n $ssd \n";echo " LLOG consumption: $llogGib";echo " PLOG consumption: $plogGib";echo " Total log consumption: $allGibTotal";done;sleep 30; done ;
Sample output:Mon Nov 10 04:17:41 UTC 2025
52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx NOTE: it will not display anything if zero logCongestion:25252d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxLLOG consumption: 0.311802 PLOG consumption: 23.6882 Total log consumption: 24
esxcfg-advcfg -g /LSOM/lsomLogCongestionLowLimitGB16
esxcfg-advcfg -g /LSOM/lsomLogCongestionHighLimitGB24
VMware vSAN 8.x (applicable for vSAN OSA only)
Virtual machines are marked as invalid due to very high log congestion on the disk group which is caused due to failed capacity disk in the vSAN disk group.
When the relog on the failed capacity disk does not happen, it causes PLOG buildup leading to congestion and latencies at the VM level. Relog is an internal process of vSAN which is used to free up the space in LSOM layer for log reclamation. Relog does not happen on device if device remains in repair state for long time which might lead to log buildup.
From /var/run/log/vsandevicemonitord.log file, below events are reported indicating that DDH has detected the disk has exceeded the IO latency threshold during the monitoring interval.
WARNING - WRITE Average Latency on VSAN device naa.xxxxxxx has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.
Events from vmkernel.log indicate that the data evacuation task is in progress and VOB message is reported indicating that the log congestion threshold is reached.
2025-11-10T04:33:50.291Z In(182) vmkernel: cpu16:2098902)LSOM: LSOMEventNotify:8407: Throttled: Waiting for open component countto drop to zero on disk 52bcxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ----> problematic capacity disk
2025-11-10T04:33:54.275Z In(182) vmkernel: cpu9:2098902)LSOM: LSOMThrowCongestionVOB:482: Throttled: vSAN node <hostname> maximum LogCong in 52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx reached.
To address this issue, remove the faulty capacity drive from the disk group
Place the host into maintenance mode with ensure accessibility
Remove the failed capacity drive from the disk group. While trying to remove the failed capacity disk, below error might be encountered since DDH mechanism is also trying to unmount the disk which is hung as there are open component count. In such cases reboot the host to kill the lock and then remove the failed capacity drive from the disk group
A general system error occurred: Failed to get VsanInfo operation lock for diskOpLock, an operation is currently in progress(locked pid: 0), error: /tmp/.vsanDiskOpLock.lock.LOCK: timeout waiting for lock after 30 seconds. Lock is currently held by process 2314628 (vsanesxcmd: /usr/lib/vmware/vsan/bin/vsanesxcmd storage diskgroup unmount -d naa.5000xxxxxxxxxxxxxx)As soon as the problematic drive is removed from the disk group, the log congestion will be automatically addressed. If the problem persists, please reach out to Broadcom Support.