[root@Host_1:~] esxcli vsan debug object health summary getHealth Status Number Of Objects--------------------------------------------------------- -----------------remoteAccessible 0inaccessible 21reduced-availability-with-no-rebuild 54reduced-availability-with-no-rebuild-delay-timer 0reducedavailabilitywithpolicypending 0reducedavailabilitywithpolicypendingfailed 0reduced-availability-with-active-rebuild 2healthy 0
VMware vSAN (All Versions)
This issue occurs due to unrecoverable medium errors on a vSAN capacity drive causing SSD congestion on Disk group.
Unrecoverable errors will prompt the system to initiate multiple I/O retries.These retries cause outstanding I/Os to queue up, waiting to be de-staged from the cache tier to the capacity tier. When the write cache (cache tier) on the disk group becomes overloaded with pending data, it results in SSD congestion.
As there are no actual pending I/Os completing on the capacity disk, the vSAN elevator completely halts the de-staging process. The severe SSD congestion combined with the stalled de-staging leads to the vSAN objects (and the VMs relying on them) to become inaccessible.
/var/run/log/vmkernel logs report the SCSI code "0x3 0x11 0x0" on the capacity disk confirming unrecoverable medium errors on the disk..
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu12:2098074)WARNING: HPP: HppScsiThrottleLogForDevice:585: Cmd 0x28 (0x45bf4dad2140, 0) to dev "naa.###############" on path "vmhba5:C2:T5:L0" Failed:YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu12:2098074)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. hppAction = 1YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu4:2097569)WARNING: LSOMCommon: IORETRYInsertQEntryOnError:1714: Throttled: Restarting IO's retry from the top, for queue 0x4500c2abbe08YYYY-MM-DDTHH:MM:SS.ZZ In(182) vmkernel: cpu6:2098054)LSOMCommon: IORETRY_handleCompletionOnError:2040: Throttled: 0x45be78fcf2c0 IO type 264 (READ) isOrdered:NO isSplit:NO isEncr:NO since 1501 msec status Read errorYYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu1:2118403)WARNING: PLOG: DDPCompleteDDPWrite:8182: Throttled: DDP write failed Read error callback [email protected]#0.0.0.1, diskgroup ########-####-####-####-############ txnScopeIdx 9YYYY-MM-DDTHH:MM:SS.ZZ In(182) vmkernel: cpu1:2118403)PLOG: PLOGDeviceMarkUnhealthy:381: Moving disk ########-####-####-####-############ to unheatlhy no evacuation state, ioStatus: Read error
/var/run/log/vmkernel logs reports the disk group is experiencing SSD congestion.
YYYY-MM-DDTHH:MM:SS.ZZ cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.YYYY-MM-DDTHH:MM:SS.ZZ cpu5:33450)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node [email protected] maximum SSD ########-####-####-####-############ congestion reached.[root@Host_1:~] for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done;
########-####-####-####-############ ------> Disk group UUID memCongestion:0 slabCongestion:0 ssdCongestion:252 iopsCongestion:0 logCongestion:0 compCongestion:0 maxDeleteCongestion:0 mdDeleteCongestion:0 memCongestionLocalMax:0 slabCongestionLocalMax:0 ssdCongestionLocalMax:252 iopsCongestionLocalMax:0 logCongestionLocalMax:0 compCongestionLocalMax:0 mdDeleteCongestionLocalMax:0
Please engage your hardware vendor to replace the faulty disk.
Requirements when replacing disks in a vSAN cluster
Workaround:
To temporarily restore object accessibility, take the host into maintenance mode with ensure accessibility.