vSAN Disk Group shows as unhealthy and cache disk shows as evacuated
search cancel

vSAN Disk Group shows as unhealthy and cache disk shows as evacuated

book

Article ID: 438445

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms :

  • vSAN disk group reports an unhealthy state on ESXi host and disk status shows as evacuated 

         

 

  • Deduplication and compression: Disabled

Issue Validation :

  • From ESXi Host > Configure > Storage Devices, the disk is visible and appears attached
  • Validation using vsan storage list confirms:
    • Disk is present in CMMDS
    • Disk is mounted and actively used by the host
    • No checksum or on-disk format inconsistencies detected

      naa.#####9:
         Device: naa.5####
         Display Name: naa.######
         Is SSD: true
         VSAN UUID:#####
         VSAN Disk Group UUID: #####4
         VSAN Disk Group Name: ######
         Used by this host: true
         In CMMDS: true
         On-disk format version: 19
         Deduplication: false
         Compression: false
         Checksum: 180###5381
         Checksum OK: true
         Is Capacity Tier: false
         Encryption Metadata Checksum OK: true
         Encryption: false
         DiskKeyLoaded: false
         Is Mounted: true
         Creation Time: Tue Apr  7 07:22:36 2026
  • Object health validation indicates all vSAN objects are healthy
  • Additional validation from iLO confirms that the disk is physically visible

Environment

VMWare vSAN 8.x (OSA)

Cause

LLOG accumulation occurs when the Commit Flusher stops moving data from the write buffer to the PLOG. This backup is typically caused by high latency or hardware failures on the underlying cache disks

  • The disk evacuation is caused by vSAN Dying Disk Handling (DDH) mechanism.
  • When vSAN detects excessive log congestion in the cache tier during monitoring intervals, it proactively:
    • Marks the disk group as unhealthy
    • Initiates data evacuation

Cause Validation :

  • Congestion metrics indicate logCongestion reaching 252, exceeding the threshold

  • Command to check congestion :for ssd in $(localcli vsan storage list |grep "5###94"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done

    [root@C####3:/vmfs/volumes/6####/log] for ssd in $(localcli vsan storage list |grep "5###94"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done

    memCongestion:0
    slabCongestion:0
    ssdCongestion:0
    iopsCongestion:0
    logCongestion:252
    compCongestion:0
    maxDeleteCongestion:0
    mdDeleteCongestion:0
    memCongestionLocalMax:0
    slabCongestionLocalMax:0
    ssdCongestionLocalMax:0
    iopsCongestionLocalMax:0
    logCongestionLocalMax:252
    compCongestionLocalMax:0
    mdDeleteCongestionLocalMax:0

    ================================================
   
    ########### NOTE: it will not display anything if zero
    logCongestion:252
   
52f###-f###-###-###-#####
   
LLOG consumption: 23.9982
   
PLOG consumption: 0.00183868
   
Total log consumption: 24

  • Persistent LSOM congestion - From both less /var/run/log/vobd.log and /var/run/log/vmkernel.log: LSOM MemCong, SSDCong, LogCong exceeded threshold (200 --> 204) - Repeated congestion over time, not a one-off spike. This means the LSOM (vSAN Local Log-Structured Object Manager) cannot keep up with I/O, especially in the cache/log layer.

      2026-04-07T15:31:17.644Z In(182) vmkernel: cpu38:27301040)LSOM: LSOMThrowAsyncCongestionVOB:550: LSOM MemCong in ##### Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 204.
      2026-04-07T15:37:30.897Z In(182) vmkernel: cpu37:27301040)LSOM: LSOMThrowAsyncCongestionVOB:550: LSOM SSDCong in ######4 Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 204.
      2026-04-16T01:59:48.986Z In(182) vmkernel: cpu23:27301040)LSOM: LSOMThrowAsyncCongestionVOB:550: LSOM LogCong in 5##### Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 202.

  • SSD log corruption / reinitialization - /var/run/log/vmkernel.log - vSAN log structure on the cache disk is corrupted and System attempted recovery but failed

     2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOGInitDescForIO:1153: device: 52######94 Recovering ssdlog. It might take a while...
     2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOG_IsValidCP:214: device: 52######4 Invalid checkpoint magic/version. Magic 0x0,ver 0:0
     2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOG_IsValidCP:214: device: 5#### Invalid checkpoint magic/version. Magic 0x0,ver 0:0
     2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOG_Recover:340: device: 5######Both checkpoints are invalid.. Disk needs to be initialized
     2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOGInitDescForIO:1157: device: 5######### SSD is not initialized, initializing...

  • Backend storage/controller errors -- less /var/run/log/vmkernel.log indicates I/O failures at the controller level.

          2026-04-07T12:12:32.252Z In(182) vmkernel: cpu19:2097829)ScsiDeviceIO: 4580: Cmd(0x45bae7d6ddc0) 0x2a, CmdSN 0x1bf65aef from world 0 to dev "#####9" failed H:0xc D:0x0 P:0x0
     2026-04-07T12:12:32.450Z Wa(180) vmkwarning: cpu2:2097827)WARNING: HPP: HppScsiThrottleLogForDevice:585: Cmd 0x2a (0x45bac84dabc0, 0) to dev "####9" on path "vmhba1:C0:T8:L0" Failed:
     2026-04-07T12:12:32.450Z Wa(180) vmkwarning: cpu2:2097827)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. hppAction = 3

  • Later vSAN marks device unhealthy -- /var/run/log/vobd.log - vSAN considers , Disk group log congested and Device unhealthy

        2026-04-17T07:27:36.784Z In(14) vobd[2097588]:  [vSANCorrelator] 10290776758635us: [vob.vsan.lsom.diskunhealthy] vSAN device 52######## is unhealthy.
    2026-04-17T07:27:36.784Z In(14) vobd[2097588]:  [vSANCorrelator] 10290886526702us: [esx.problem.vob.vsan.lsom.diskunhealthy] vSAN device #######4 is unhealthy.
    2026-04-17T07:27:36.784Z In(14) vobd[2097588]:  [vSANCorrelator] 10290776758643us: [vob.vsan.lsom.diskgrouplogcongested] vSAN diskgroup ###### log is congested.
    2026-04-17T07:27:36.784Z In(14) vobd[2097588]:  [vSANCorrelator] 10290886526788us: [esx.problem.vob.vsan.lsom.diskgrouplogcongested] vSAN diskgroup 5######## log is congested.

  • Automatic evacuation triggered -- /var/run/log/vsandevicemonitored.log - vSAN then, evacuated data from the cache disk, treating it as a failure scenario

2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: WARNING - Maximum log congestion on VSAN device naa.### 2/2 times.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Found congestion: Evacuating disk naa.5####..
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Exception getting SMART health status for vSAN disk naa###9.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Critical SMART health attributes for VSAN device naa.#### are shown below.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Uncorrectable sectors: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Reported uncorrectable sectors: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Sector reallocation events: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Sectors successfully reallocated: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Pending sector reallocations: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Disk command timeouts: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Tier 1 (naa.5#####9) failure due to log congestion.

  • Final state /var/run/log/vsandevicemonitored.log - Disk group is permanently failed and not a transient issue anymore

     2026-04-17T07:37:37Z In(14) vsandevicemonitord[2099430]: Device naa.#####9 state is DISKGROUP_UNDER_PERM_ERROR

Resolution

  • Place the affected ESXi host into Maintenance Mode, ensuring you select the Ensure Accessibility option.
  • Reboot the affected ESXi host.
  • Once the host is back online, verify the health status of the affected disk and follow the applicable scenario below:

             Scenario A (Workaround): Disk State is Healthy , If the reboot clears the error and the disk shows as healthy, proceed with the following steps to rebuild the disk group:

              Remove the affected disk group.

              Re-create and add the disk group back to the vSAN cluster.

              Exit Maintenance Mode on the ESXi host.

            Scenario B (Resolution): Disk State Remains Unhealthy (Hardware Failure) If the disk still shows as unhealthy after the reboot, the drive has entered a permanent error state.

              Keep the host in Maintenance Mode.

  •  Contact your hardware vendor to dispatch a replacement for the failed disk