Degraded hard disk on one host caused VMs to become inaccessible on other hosts in a vSAN cluster

Products

VMware vSAN

Issue/Introduction

Hard disk on a vSAN host shows as degraded in the hardware console.
Virtual machines running on other hosts in the cluster may report as inaccessible.
Multiple vSAN Objects will shows a inaccessible.

[root@Host_1:~] esxcli vsan debug object health summary get
Health Status Number Of Objects
--------------------------------------------------------- -----------------
remoteAccessible 0
inaccessible 21
reduced-availability-with-no-rebuild 54
reduced-availability-with-no-rebuild-delay-timer 0
reducedavailabilitywithpolicypending 0
reducedavailabilitywithpolicypendingfailed 0
reduced-availability-with-active-rebuild 2
healthy 0

Environment

VMware vSAN (All Versions)

Cause

This issue occurs due to unrecoverable medium errors on a vSAN capacity drive causing SSD congestion on Disk group.

Unrecoverable errors will prompt the system to initiate multiple I/O retries.These retries cause outstanding I/Os to queue up, waiting to be de-staged from the cache tier to the capacity tier. When the write cache (cache tier) on the disk group becomes overloaded with pending data, it results in SSD congestion.
As there are no actual pending I/Os completing on the capacity disk, the vSAN elevator completely halts the de-staging process. The severe SSD congestion combined with the stalled de-staging leads to the vSAN objects (and the VMs relying on them) to become inaccessible.

/var/run/log/vmkernel logs report the SCSI code "0x3 0x11 0x0" on the capacity disk confirming unrecoverable medium errors on the disk..

YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu12:2098074)WARNING: HPP: HppScsiThrottleLogForDevice:585: Cmd 0x28 (0x45bf4dad2140, 0) to dev "naa.###############" on path "vmhba5:C2:T5:L0" Failed:
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu12:2098074)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. hppAction = 1
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu4:2097569)WARNING: LSOMCommon: IORETRYInsertQEntryOnError:1714: Throttled: Restarting IO's retry from the top, for queue 0x4500c2abbe08
YYYY-MM-DDTHH:MM:SS.ZZ In(182) vmkernel: cpu6:2098054)LSOMCommon: IORETRY_handleCompletionOnError:2040: Throttled: 0x45be78fcf2c0 IO type 264 (READ) isOrdered:NO isSplit:NO isEncr:NO since 1501 msec status Read error
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu1:2118403)WARNING: PLOG: DDPCompleteDDPWrite:8182: Throttled: DDP write failed Read error callback [email protected]#0.0.0.1, diskgroup ########-####-####-####-############ txnScopeIdx 9
YYYY-MM-DDTHH:MM:SS.ZZ In(182) vmkernel: cpu1:2118403)PLOG: PLOGDeviceMarkUnhealthy:381: Moving disk ########-####-####-####-############ to unheatlhy no evacuation state, ioStatus: Read error
/var/run/log/vmkernel logs reports the disk group is experiencing SSD congestion.

YYYY-MM-DDTHH:MM:SS.ZZ cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.

YYYY-MM-DDTHH:MM:SS.ZZ cpu5:33450)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node [email protected] maximum SSD ########-####-####-####-############ congestion reached.

Below script can be used to verify the current congestion value on vSAN disk group.

[root@Host_1:~] for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done;

########-####-####-####-############ ------> Disk group UUID
memCongestion:0
slabCongestion:0
ssdCongestion:252
iopsCongestion:0
logCongestion:0
compCongestion:0
maxDeleteCongestion:0
mdDeleteCongestion:0
memCongestionLocalMax:0
slabCongestionLocalMax:0
ssdCongestionLocalMax:252
iopsCongestionLocalMax:0
logCongestionLocalMax:0
compCongestionLocalMax:0
mdDeleteCongestionLocalMax:0

Resolution

Please engage your hardware vendor to replace the faulty disk.
Requirements when replacing disks in a vSAN cluster

Workaround:
To temporarily restore object accessibility, take the host into maintenance mode with ensure accessibility.

Additional Information

Understanding Congestion in vSAN
SSD log buildup can cause poor performance in a VMware vSAN Cluster
Requirements when replacing disks in a vSAN cluster