Disk failure in vSAN ESA Cluster

Products

VMware vSAN

Issue/Introduction

Symptom:

Operation health and vSAN object health alert seen on vSAN cluster.

Steps: Select vSAN Cluster > Monitor > Skyline Health
Absent vSAN disk error seen for vSAN node with Stuck I/O state.

Steps: Select vSAN Cluster > Monitor > Skyline Health > Under "Operation health", select "Troubleshoot".
In vCenter UI, the Skyline health (vSAN cluster > Monitor > vSAN > Skyline Health > Physical Disk > Operation health) may report permanent disk failure for the vSAN disk as seen in the figure below:

vSAN disk shows up in detached state.

Steps: Select vSAN Cluster > Configure > Disk Management.

Validation:

Running "vdq -iH", throws an error suggesting a device cannot be opened.

[root@esx-0l :~ ] vdq -iH
VsanUtil: : ReadFromDevice: Failed to open , errno (2)
VsanUtil: : GetVsanStoragePoolDisks: Error occurred 'Failed to open device ', create disk with null id
SingleTierDisks:
"singleTier" : [
"eui.################1###############",
"eui.################2###############",
"eui.################3###############",
"eui.################4###############",
"eui.################5###############",
"eui.################6###############",
"eui.################7###############",
"eui.################8###############",
"eui.################9###############",
"eui.################10##############",
"eui.################11##############",
"eui.################12##############",
"eui.################13##############",
"eui.################14##############",
"eui.################15##############",
"eui.################16##############"
]
Running below command will list all the disk which are part of the storage pool.

[root@esx-0l :~ ] esxcli vsan storagepool list | grep -i cmmds
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: false

Environment

VMware vSphere vSAN 8.x

Cause

IO is stuck outside of ESXi (controller/firmware) and does not complete or respond to abort request. If the device/controller doesn’t respond to the abort within 120 seconds (default timeout) vSAN will take the disk/Disk Group to offline state to avoid affecting the entire vSAN cluster.

In "/var/run/log/vsanmgmt.log", we see below events -

YYYY-MM-DDThh:mm:ss.msZ In(14) vsand[2101102]: [opID=23325f16-8602 VsanLsomHealth::checkDiskState] Got devResState from devsTelemetry for disk 527be2c9-####-####-####-f3f9b5467257: DISK_UNDER_STUCK_IO YYYY-MM-DDThh:mm:ss.msZ In(14) vsand[2101102]: [opID=23325f16-8602 VsanHealthSystemImpl::_QueryPhysicalDiskHealthSummary] Disk 527be2c9-####-####-####-f3f9b5467257 cmmds health status: {'healthFlags': 0, 'timestamp': 42578920677, 'healthReason': 0} , LSOM telemetry status: STUCK_IO_ERROR
In "/var/run/log/vobd.log", we see below events -

YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094150us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba0:C0:T2:L0 changed state from on (device ID: eui.################17##############)
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094842us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device eui.################17##############. Path vmhba0:C0:T2:L0 is down. Affected datastores: Unknown.
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094185us: [vob.scsi.device.state.permanentloss] Device :eui.################17############## has been removed or is permanently inaccessible.
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094972us: [esx.problem.scsi.device.state.permanentloss] Device: eui.################17############## has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[vSANCorrelator] 2672996104426us: [vob.vsan.pdl.offline] vSAN device 527be2c9-####-####-####-f3f9b5467257 has gone offline.
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[vSANCorrelator] 2672996104441us: [esx.problem.vob.vsan.pdl.offline] vSAN device 527be2c9-####-####-####-f3f9b5467257 has gone offline.
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672997219440us: [vob.scsi.device.state.permanentloss.noopens] Permanently inaccessible device :eui.################17############## has no more open connections. It is now safe to unmount datastores (if any) and delete the device.
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672997219412us: [esx.problem.scsi.device.state.permanentloss.noopens] Permanently inaccessible device: eui.################17############## has no more opens. It is now safe to unmount datastores (if any): Unknown and delete the device
YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672997219912us: [vob.scsi.scsipath.remove] Remove path: vmhba0:C0:T2:L0

The vmkernel.log contains entries indicating that ESXi marked a storage device as offline after it exceeded the maximum number of I/O retry attempts:

2025-06-26T11:17:07.965Z In(182) vmkernel: cpu71:2255837)WOBTREE: IOLayer_SetDeviceOffline:8596: OfflineDevice t10.NVMe____INTEL_##############____________________################:2 status=Maximum kernel-level retries exceeded errType TRANSIENT

The vmkernel.log file contains an entry indicating that ESXi detected a stuck I/O operation on a storage device, which typically suggests the device is unresponsive or failing to complete I/O requests:

2025-06-26T11:17:16.082Z In(182) vmkernel: cpu112:2286648)StorageDeviceIO: 5697: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____INTEL_##############____________________################

Resolution

These issues are seen due to faulty hardware or firmware bugs. Proceed with opening a case with the hardware vendor.

Additional Information

How to handle lost or stuck I/O on a host in vSAN cluster