How to handle lost or stuck I/O on a host in vSAN cluster
search cancel

How to handle lost or stuck I/O on a host in vSAN cluster

book

Article ID: 326885

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article provides information on resolving stuck I/O on a vSAN environment.
 
Impact/Risks:
Lost or wedged I/O is an I/O which is stuck outside of ESXi (device controller/firmware) that does not complete and doesn’t respond to abort and/or abort never completes.
 
Since the I/O is stuck outside of ESXi, the only option ESXi has is to send an abort. If the device/controller doesn’t respond to the abort within 120 seconds (default timeout) vSAN will take the disk/DG to offline state to avoid affecting the entire vSAN cluster.

Symptoms:

If I/O is stuck or lost on the storage controller or the storage disk, the ESXi storage stack will try to abort them using the task management request displaying these console messages:

2021-06-22T12:02:08.408Z cpu30:1001397101)ScsiDeviceIO: PsaScsiDeviceTimeoutHandlerFn:12834: TaskMgmt op to cancel IO succeeded for device naa.55cd2e404b7736d0 and the IO did not complete. WorldId 0, Cmd 0x28, CmdSN = 0x428.Cancelling of IO will be
2021-06-22T12:02:08.408Z cpu30:1001397101)retried.


If such a lost I/O is found on a host, vSAN will offline the disk to ensure that it doesn't affect other hosts on the cluster.

We see the following alert in /var/run/log/vobd.log:
2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827057us: [vob.vsan.lsom.stuckiooffline] vSAN device 5296eb1f-c017-68b0-9c97-dea29ae522f8 detected stuck I/O error. Marking the device as offline.
2021-06-22T12:04:04.237Z: [vSANCorrelator] 19607829404us: [esx.problem.vob.vsan.lsom.stuckiooffline] vSAN device 5296eb1f-c017-68b0-9c97-dea29ae522f8 detected stuck I/O error. Marking the device as offline


If the  cache device in non-dedup disk group encounters stuck I/O or if any of the disk in dedup disk group encounters stuck I/O, the entire disk group will be set to offline state.  

2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827040us: [vob.vsan.lsom.stuckiopropagated] vSAN device 52e9c739-e025-c001-eb29-62d02f0df0bc is under propagated stuck I/O error. Marking the device as offline.
2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607828405us: [esx.problem.vob.vsan.lsom.stuckiopropagated] vSAN device 52e9c739-e025-c001-eb29-62d02f0df0bc is under propagated stuck I/O error. Marking the device as offline.


In vCenter you'll see


The following health alert is also shown if the cache device in non-dedup DG encounters stuck IO. A similar health alert is also shown if any of the disks in dedup disk group encounters stuck IO.

esxcli vsan health cluster get -t 'Operation health'
Operation health red


Checks the operation health of the physical disks for all hosts in the vSAN cluster 

Disks with issues
Host Disk Overall health Metadata health Operational health In CMMDS/VSI Operational State Description Recommendation UUID

10.158.64.25 Local ATA Disk (naa.55cd2e404b766b2c) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 52e9c739-e025-c001-eb29-62d02f0df0bc
10.158.64.25 Local ATA Disk (naa.55cd2e404b7733c8) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 52f4590c-149f-3e04-2e48-26249e39f8e6
10.158.64.25 Local ATA Disk (naa.55cd2e404b7736d0) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 5296eb1f-c017-68b0-9c97-dea29ae522f8

vSAN Skyline Health in vCenter



Environment

VMware vSAN 6.7.x
VMware vSAN 7.0.x
VMware vSAN 8.0.x

Resolution

Migrate the workload and power cycle the host. After power cycle of the host, collect the vm-support along with driver/firmware logs. These issues are seen due to faulty hardware or firmware bugs. The customer needs to open a case with the hardware vendor by collecting the hardware ( storcli and/or sascli logs) logs.

As of ESXi versions - 7.0U3, the disk/diskgroup is set to offline state in case of stuck IO. In previous ESX versions before 70U3 and in 6.7 version (67U3 onwards), the host is PSODed to ensure that it doesn't affect other hosts on the cluster.