Virtual machines in the vSAN cluster are reporting as invalid - log congestion reported on vSAN diskgroup
search cancel

Virtual machines in the vSAN cluster are reporting as invalid - log congestion reported on vSAN diskgroup

book

Article ID: 418349

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • All or most of the virtual machines residing on the vSAN datastores are marked as invalid

  • vCenter server is down as it is residing on the vSAN datastore and is marked as invalid

  • None of the objects are in inaccessible state though the virtual machines are reporting as invalid and almost all the objects are in healthy state

    This can be validated by using the below command:

    esxcli vsan debug object health summary get
    Health Status                                              Number Of Objects

    ---

    remoteAccessible                                                           0
    inaccessible                                                               0
    reduced-availability-with-no-rebuild                                       1
    reduced-availability-with-no-rebuild-delay-timer                           1
    reducedavailabilitywithpolicypending                                       0
    reducedavailabilitywithpolicypendingfailed                                 0
    reduced-availability-with-active-rebuild                                  26
    reducedavailabilitywithpausedrebuild                                       0
    data-move                                                                  0
    nonavailability-related-reconfig                                           0
    nonavailabilityrelatedincompliancewithpolicypending                        0
    nonavailabilityrelatedincompliancewithpolicypendingfailed                  0
    nonavailability-related-incompliance                                       0
    nonavailabilityrelatedincompliancewithpausedrebuild                        0
    healthy                                                                  233

  • Physical disk issues and congestion issues are reported in the vSAN Skyline Health 

    esxcli vsan health cluster list
    ---------------------------------------------------------------------------
    Health Test Name                                  Status
    Overall health findings                           red (Physical disk issue)
    Physical disk                                     red
    Operation health                                  yellow
    Congestion                                        red
    Component limit health                            green
    Component metadata health                         green
    Memory pools (heaps)                              green

  • Congestion levels indicate log congestion. Use the below command to validate the congestion levels:

    for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done

    Sample output:
    52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    memCongestion:0
    slabCongestion:0
    ssdCongestion:0
    iopsCongestion:0
    logCongestion:252
    compCongestion:0
    maxDeleteCongestion:0
    mdDeleteCongestion:0
    memCongestionLocalMax:0
    slabCongestionLocalMax:0
    ssdCongestionLocalMax:0
    iopsCongestionLocalMax:0
    logCongestionLocalMax:252
    compCongestionLocalMax:0
    mdDeleteCongestionLocalMax:0

  • Execute the following command to validate vSAN LLOG consumption levels. The output indicates high PLOG consumption.

    #while true; do clear; echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo -e "$ssd NOTE: it will not dispay anything if zero" ;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Congestion:"|grep -v ":0";done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo -e "\n  $ssd \n";echo " LLOG consumption: $llogGib";echo " PLOG consumption: $plogGib";echo " Total log consumption: $allGibTotal";done;sleep 30; done ; 

    Sample output:

    Mon Nov 10 04:17:41 UTC 2025
    52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx NOTE: it will not display anything if zero logCongestion:252
     
    52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

    LLOG consumption: 0.311802
    PLOG consumption: 23.6882
    Total log consumption: 24

  • The default log consumption limits are 16GB and 24GB

    esxcfg-advcfg -g /LSOM/lsomLogCongestionLowLimitGB
    16

    esxcfg-advcfg -g /LSOM/lsomLogCongestionHighLimitGB
    24

Environment

VMware vSAN 8.x (applicable for vSAN OSA only)

Cause

Virtual machines are marked as invalid due to very high log congestion on the disk group which is caused due to failed capacity disk in the vSAN disk group.

When the relog on the failed capacity disk does not happen, it causes PLOG buildup leading to congestion and latencies at the VM level. Relog is an internal process of vSAN which is used to free up the space in LSOM layer for log reclamation. Relog does not happen on device if device remains in repair state for long time which might lead to log buildup.

Cause Validation:

From /var/run/log/vsandevicemonitord.log file, below events are reported indicating that DDH has detected the disk has exceeded the IO latency threshold during the monitoring interval. 
 
WARNING - WRITE Average Latency on VSAN device naa.xxxxxxx has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.

Events from vmkernel.log indicate that the data evacuation task is in progress and VOB message is reported indicating that the log congestion threshold is reached.

2025-11-10T04:33:50.291Z In(182) vmkernel: cpu16:2098902)LSOM: LSOMEventNotify:8407: Throttled: Waiting for open component countto drop to zero on disk 52bcxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ----> problematic capacity disk

2025-11-10T04:33:54.275Z In(182) vmkernel: cpu9:2098902)LSOM: LSOMThrowCongestionVOB:482: Throttled: vSAN node <hostname> maximum LogCong in 52d9xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx reached.

Resolution

To address this issue, remove the faulty capacity drive from the disk group

  • Place the host into maintenance mode with ensure accessibility

  • Remove the failed capacity drive from the disk group. While trying to remove the failed capacity disk, below error might be encountered since DDH mechanism is also trying to unmount the disk which is hung as there are open component count. In such cases reboot the host to kill the lock and then remove the failed capacity drive from the disk group

    A general system error occurred: Failed to get VsanInfo operation lock for diskOpLock, an operation is currently in progress(locked pid: 0), error: /tmp/.vsanDiskOpLock.lock.LOCK: timeout waiting for lock after 30 seconds. Lock is currently held by process 2314628 (vsanesxcmd: /usr/lib/vmware/vsan/bin/vsanesxcmd storage diskgroup unmount -d naa.5000xxxxxxxxxxxxxx)

As soon as the problematic drive is removed from the disk group, the log congestion will be automatically addressed. If the problem persists, please reach out to Broadcom Support.