Disk failure in vSAN ESA Cluster
search cancel

Disk failure in vSAN ESA Cluster

book

Article ID: 398026

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptom:

  • Operation health and vSAN object health alert seen on vSAN cluster.

    Steps: Select vSAN Cluster > Monitor > Skyline Health 


  • Absent vSAN disk error seen for vSAN node with Stuck I/O state.

    Steps: Select vSAN Cluster > Monitor > Skyline Health > Under "Operation health", select "Troubleshoot".

  • In vCenter UI, the Skyline health (vSAN cluster > Monitor > vSAN > Skyline Health > Physical Disk > Operation health) may report permanent disk failure for the vSAN disk as seen in the figure below:

  • vSAN disk shows up in detached state.

    Steps: Select vSAN Cluster > Configure > Disk Management.

Validation:

  • Running "vdq -iH", throws an error suggesting a device cannot be opened.

    [root@esx-0l :~ ] vdq -iH
    VsanUtil: : ReadFromDevice: Failed to open , errno (2)
    VsanUtil: : GetVsanStoragePoolDisks: Error occurred 'Failed to open device ', create disk with null id
    SingleTierDisks:
         "singleTier" : [
                        "eui.################1###############",
                        "eui.################2###############",
                        "eui.################3###############",
                        "eui.################4###############",
                        "eui.################5###############",
                        "eui.################6###############",
                        "eui.################7###############",
                        "eui.################8###############",
                        "eui.################9###############",
                        "eui.################10##############",
                        "eui.################11##############",
                        "eui.################12##############",
                        "eui.################13##############",
                        "eui.################14##############",
                        "eui.################15##############",
                        "eui.################16##############"
                        ]

  • Running below command will list all the disk which are part of the storage pool.

    [root@esx-0l :~ ] esxcli vsan storagepool list | grep -i cmmds
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: true
    In CMMDS: false

Environment

VMware vSphere vSAN 8.x

Cause

IO is stuck outside of ESXi (controller/firmware) and does not complete or respond to abort request. If the device/controller doesn’t respond to the abort within 120 seconds (default timeout) vSAN will take the disk/Disk Group to offline state to avoid affecting the entire vSAN cluster.

  • In "/var/run/log/vsanmgmt.log", we see below events -

    YYYY-MM-DDThh:mm:ss.msZ In(14) vsand[2101102]: [opID=23325f16-8602 VsanLsomHealth::checkDiskState] Got devResState from devsTelemetry for disk 527be2c9-####-####-####-f3f9b5467257: DISK_UNDER_STUCK_IO
    YYYY-MM-DDThh:mm:ss.msZ In(14) vsand[2101102]: [opID=23325f16-8602 VsanHealthSystemImpl::_QueryPhysicalDiskHealthSummary] Disk 527be2c9-####-####-####-f3f9b5467257 cmmds health status: {'healthFlags': 0, 'timestamp': 42578920677, 'healthReason': 0} , LSOM telemetry status: STUCK_IO_ERROR


  • In "/var/run/log/vobd.log", we see below events -

    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094150us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba0:C0:T2:L0 changed state from on (device ID: eui.################17##############)
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094842us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device eui.################17##############. Path vmhba0:C0:T2:L0 is down. Affected datastores: Unknown.
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094185us: [vob.scsi.device.state.permanentloss] Device :eui.################17############## has been removed or is permanently inaccessible.
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672996094972us: [esx.problem.scsi.device.state.permanentloss] Device: eui.################17############## has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[vSANCorrelator] 2672996104426us: [vob.vsan.pdl.offline] vSAN device 527be2c9-####-####-####-f3f9b5467257 has gone offline.
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[vSANCorrelator] 2672996104441us: [esx.problem.vob.vsan.pdl.offline] vSAN device 527be2c9-####-####-####-f3f9b5467257 has gone offline.
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672997219440us: [vob.scsi.device.state.permanentloss.noopens] Permanently inaccessible device :eui.################17############## has no more open connections. It is now safe to unmount datastores (if any) and delete the device.
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672997219412us: [esx.problem.scsi.device.state.permanentloss.noopens] Permanently inaccessible device: eui.################17############## has no more opens. It is now safe to unmount datastores (if any): Unknown and delete the device
    YYYY-MM-DDThh:mm:ss.msZ In(14) vobd[524562]:[scsiCorrelator] 2672997219912us: [vob.scsi.scsipath.remove] Remove path: vmhba0:C0:T2:L0

 

  • The vmkernel.log contains entries indicating that ESXi marked a storage device as offline after it exceeded the maximum number of I/O retry attempts:

2025-06-26T11:17:07.965Z In(182) vmkernel: cpu71:2255837)WOBTREE: IOLayer_SetDeviceOffline:8596: OfflineDevice t10.NVMe____INTEL_##############____________________################:2 status=Maximum kernel-level retries exceeded errType TRANSIENT

  • The vmkernel.log file contains an entry indicating that ESXi detected a stuck I/O operation on a storage device, which typically suggests the device is unresponsive or failing to complete I/O requests:

2025-06-26T11:17:16.082Z In(182) vmkernel: cpu112:2286648)StorageDeviceIO: 5697: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____INTEL_##############____________________################

Resolution

  • These issues are seen due to faulty hardware or firmware bugs. Proceed with opening a case with the hardware vendor.

Additional Information