How to handle lost or stuck I/O on a host in vSAN cluster
book
Article ID: 326885
calendar_today
Updated On:
Products
VMware vSAN
Issue/Introduction
This article provides information on resolving stuck I/O on a vSAN environment.
Impact/Risks: Lost or wedged I/O is an I/O which is stuck outside of ESXi (device controller/firmware) that does not complete and doesn’t respond to abort and/or abort never completes. Since the I/O is stuck outside of ESXi, the only option ESXi has is to send an abort. If the device/controller doesn’t respond to the abort within 120 seconds (default timeout) vSAN will take the disk/DG to offline state to avoid affecting the entire vSAN cluster.
Symptoms:
If I/O is stuck or lost on the storage controller or the storage disk, the ESXi storage stack will try to abort them using the task management request displaying these console messages:
2021-06-22T12:02:08.408Z cpu30:1001397101)ScsiDeviceIO: PsaScsiDeviceTimeoutHandlerFn:12834: TaskMgmt op to cancel IO succeeded for device naa.55cd2e404b7736d0 and the IO did not complete. WorldId 0, Cmd 0x28, CmdSN = 0x428.Cancelling of IO will be 2021-06-22T12:02:08.408Z cpu30:1001397101)retried.
If such a lost I/O is found on a host, vSAN will offline the disk to ensure that it doesn't affect other hosts on the cluster.
We see the following alert in /var/run/log/vobd.log: 2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827057us: [vob.vsan.lsom.stuckiooffline] vSAN device 5296eb1f-c017-68b0-9c97-dea29ae522f8 detected stuck I/O error. Marking the device as offline. 2021-06-22T12:04:04.237Z: [vSANCorrelator] 19607829404us: [esx.problem.vob.vsan.lsom.stuckiooffline] vSAN device 5296eb1f-c017-68b0-9c97-dea29ae522f8 detected stuck I/O error. Marking the device as offline
If the cache device in non-dedup disk group encounters stuck I/O or if any of the disk in dedup disk group encounters stuck I/O, the entire disk group will be set to offline state.
2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827040us: [vob.vsan.lsom.stuckiopropagated] vSAN device 52e9c739-e025-c001-eb29-62d02f0df0bc is under propagated stuck I/O error. Marking the device as offline. 2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607828405us: [esx.problem.vob.vsan.lsom.stuckiopropagated] vSAN device 52e9c739-e025-c001-eb29-62d02f0df0bc is under propagated stuck I/O error. Marking the device as offline.
In vCenter you'll see
The following health alert is also shown if the cache device in non-dedup DG encounters stuck IO. A similar health alert is also shown if any of the disks in dedup disk group encounters stuck IO.
esxcli vsan health cluster get -t 'Operation health' Operation health red
Checks the operation health of the physical disks for all hosts in the vSAN cluster
Disks with issues Host Disk Overall health Metadata health Operational health In CMMDS/VSI Operational State Description Recommendation UUID
10.158.64.25 Local ATA Disk (naa.55cd2e404b766b2c) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 52e9c739-e025-c001-eb29-62d02f0df0bc 10.158.64.25 Local ATA Disk (naa.55cd2e404b7733c8) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 52f4590c-149f-3e04-2e48-26249e39f8e6 10.158.64.25 Local ATA Disk (naa.55cd2e404b7736d0) red red red Yes/Yes Stuck I/O is detected Migrate the workload and power cycle the host 5296eb1f-c017-68b0-9c97-dea29ae522f8
Migrate the workload and power cycle the host. After power cycle of the host, collect the vm-support along with driver/firmware logs. These issues are seen due to faulty hardware or firmware bugs. The customer needs to open a case with the hardware vendor by collecting the hardware ( storcli and/or sascli logs) logs.
As of ESXi versions - 7.0U3, the disk/diskgroup is set to offline state in case of stuck IO. In previous ESX versions before 70U3 and in 6.7 version (67U3 onwards), the host is PSODed to ensure that it doesn't affect other hosts on the cluster.