vSAN Disk Group shows as unhealthy and cache disk shows as evacuated

Products

VMware vSAN

Issue/Introduction

Symptoms :

vSAN disk group reports an unhealthy state on ESXi host and disk status shows as evacuated

Deduplication and compression: Disabled

Issue Validation :

From ESXi Host > Configure > Storage Devices, the disk is visible and appears attached
Validation using vsan storage list confirms:
- Disk is present in CMMDS
- Disk is mounted and actively used by the host
- No checksum or on-disk format inconsistencies detected
  
  naa.#####9:
  Device: naa.5####
  Display Name: naa.######
  Is SSD: true
  VSAN UUID:#####
  VSAN Disk Group UUID: #####4
  VSAN Disk Group Name: ######
  Used by this host: true
  In CMMDS: true
  On-disk format version: 19
  Deduplication: false
  Compression: false
  Checksum: 180###5381
  Checksum OK: true
  Is Capacity Tier: false
  Encryption Metadata Checksum OK: true
  Encryption: false
  DiskKeyLoaded: false
  Is Mounted: true
  Creation Time: Tue Apr 7 07:22:36 2026

Object health validation indicates all vSAN objects are healthy
Additional validation from iLO confirms that the disk is physically visible

Environment

VMWare vSAN 8.x (OSA)

Cause

LLOG accumulation occurs when the Commit Flusher stops moving data from the write buffer to the PLOG. This backup is typically caused by high latency or hardware failures on the underlying cache disks

The disk evacuation is caused by vSAN Dying Disk Handling (DDH) mechanism.
When vSAN detects excessive log congestion in the cache tier during monitoring intervals, it proactively:
- Marks the disk group as unhealthy
- Initiates data evacuation

Cause Validation :

Congestion metrics indicate logCongestion reaching 252, exceeding the threshold
Command to check congestion :for ssd in $(localcli vsan storage list |grep "5###94"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done

[root@C####3:/vmfs/volumes/6####/log] for ssd in $(localcli vsan storage list |grep "5###94"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done

memCongestion:0
slabCongestion:0
ssdCongestion:0
iopsCongestion:0
logCongestion:252
compCongestion:0
maxDeleteCongestion:0
mdDeleteCongestion:0
memCongestionLocalMax:0
slabCongestionLocalMax:0
ssdCongestionLocalMax:0
iopsCongestionLocalMax:0
logCongestionLocalMax:252
compCongestionLocalMax:0
mdDeleteCongestionLocalMax:0

================================================

########### NOTE: it will not display anything if zero
logCongestion:252 52f###-f###-###-###-##### LLOG consumption: 23.9982 PLOG consumption: 0.00183868 Total log consumption: 24

Persistent LSOM congestion - From both less /var/run/log/vobd.log and /var/run/log/vmkernel.log: LSOM MemCong, SSDCong, LogCong exceeded threshold (200 --> 204) - Repeated congestion over time, not a one-off spike. This means the LSOM (vSAN Local Log-Structured Object Manager) cannot keep up with I/O, especially in the cache/log layer.

2026-04-07T15:31:17.644Z In(182) vmkernel: cpu38:27301040)LSOM: LSOMThrowAsyncCongestionVOB:550: LSOM MemCong in ##### Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 204.
2026-04-07T15:37:30.897Z In(182) vmkernel: cpu37:27301040)LSOM: LSOMThrowAsyncCongestionVOB:550: LSOM SSDCong in ######4 Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 204.
2026-04-16T01:59:48.986Z In(182) vmkernel: cpu23:27301040)LSOM: LSOMThrowAsyncCongestionVOB:550: LSOM LogCong in 5##### Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 202.

SSD log corruption / reinitialization - /var/run/log/vmkernel.log - vSAN log structure on the cache disk is corrupted and System attempted recovery but failed

2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOGInitDescForIO:1153: device: 52######94 Recovering ssdlog. It might take a while...
2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOG_IsValidCP:214: device: 52######4 Invalid checkpoint magic/version. Magic 0x0,ver 0:0
2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOG_IsValidCP:214: device: 5#### Invalid checkpoint magic/version. Magic 0x0,ver 0:0
2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOG_Recover:340: device: 5######Both checkpoints are invalid.. Disk needs to be initialized
2026-04-07T07:22:36.997Z In(182) vmkernel: cpu17:2099564 opID=e511c2e4)LSOMCommon: SSDLOGInitDescForIO:1157: device: 5######### SSD is not initialized, initializing...

Backend storage/controller errors -- less /var/run/log/vmkernel.log indicates I/O failures at the controller level.

2026-04-07T12:12:32.252Z In(182) vmkernel: cpu19:2097829)ScsiDeviceIO: 4580: Cmd(0x45bae7d6ddc0) 0x2a, CmdSN 0x1bf65aef from world 0 to dev "#####9" failed H:0xc D:0x0 P:0x0
2026-04-07T12:12:32.450Z Wa(180) vmkwarning: cpu2:2097827)WARNING: HPP: HppScsiThrottleLogForDevice:585: Cmd 0x2a (0x45bac84dabc0, 0) to dev "####9" on path "vmhba1:C0:T8:L0" Failed:
2026-04-07T12:12:32.450Z Wa(180) vmkwarning: cpu2:2097827)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. hppAction = 3

Later vSAN marks device unhealthy -- /var/run/log/vobd.log - vSAN considers , Disk group log congested and Device unhealthy

2026-04-17T07:27:36.784Z In(14) vobd[2097588]: [vSANCorrelator] 10290776758635us: [vob.vsan.lsom.diskunhealthy] vSAN device 52######## is unhealthy.
2026-04-17T07:27:36.784Z In(14) vobd[2097588]: [vSANCorrelator] 10290886526702us: [esx.problem.vob.vsan.lsom.diskunhealthy] vSAN device #######4 is unhealthy.
2026-04-17T07:27:36.784Z In(14) vobd[2097588]: [vSANCorrelator] 10290776758643us: [vob.vsan.lsom.diskgrouplogcongested] vSAN diskgroup ###### log is congested.
2026-04-17T07:27:36.784Z In(14) vobd[2097588]: [vSANCorrelator] 10290886526788us: [esx.problem.vob.vsan.lsom.diskgrouplogcongested] vSAN diskgroup 5######## log is congested.

Automatic evacuation triggered -- /var/run/log/vsandevicemonitored.log - vSAN then, evacuated data from the cache disk, treating it as a failure scenario

2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: WARNING - Maximum log congestion on VSAN device naa.### 2/2 times.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Found congestion: Evacuating disk naa.5####..
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Exception getting SMART health status for vSAN disk naa###9.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Critical SMART health attributes for VSAN device naa.#### are shown below.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Uncorrectable sectors: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Reported uncorrectable sectors: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Sector reallocation events: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Sectors successfully reallocated: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Pending sector reallocations: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Disk command timeouts: 0.
2026-04-17T07:27:36Z In(14) vsandevicemonitord[2099430]: Tier 1 (naa.5#####9) failure due to log congestion.

Final state /var/run/log/vsandevicemonitored.log - Disk group is permanently failed and not a transient issue anymore

2026-04-17T07:37:37Z In(14) vsandevicemonitord[2099430]: Device naa.#####9 state is DISKGROUP_UNDER_PERM_ERROR

Resolution

Place the affected ESXi host into Maintenance Mode, ensuring you select the Ensure Accessibility option.
Reboot the affected ESXi host.
Once the host is back online, verify the health status of the affected disk and follow the applicable scenario below:

Scenario A (Workaround): Disk State is Healthy , If the reboot clears the error and the disk shows as healthy, proceed with the following steps to rebuild the disk group:

Remove the affected disk group.

Re-create and add the disk group back to the vSAN cluster.

Exit Maintenance Mode on the ESXi host.

Scenario B (Resolution): Disk State Remains Unhealthy (Hardware Failure) If the disk still shows as unhealthy after the reboot, the drive has entered a permanent error state.

Keep the host in Maintenance Mode.

Contact your hardware vendor to dispatch a replacement for the failed disk