How to handle lost or stuck I/O on a host in vSAN cluster

Products

VMware vSAN

Issue/Introduction

Introduction:

Lost or wedged I/O is an I/O which is stuck outside of ESXi (device controller/firmware) that does not complete and doesn’t respond to abort and/or abort never completes.

Since the I/O is stuck outside of ESXi, the only option ESXi has is to send an abort. If the device/controller doesn’t respond to the abort within 120 seconds (default timeout) vSAN will take the disk/Disk Group to offline state to avoid affecting the entire vSAN cluster.

Examples of Symptoms:

Check for Skyline Health Alarm "Operation Health" via SSH/Putty to any of the vSAN Hosts:

Run the following command:

esxcli vsan health cluster get -t 'Operation health'

Example of output:

Operation health red

Host      Disk      Overall health Metadata health Operational health In CMMDS/VSI OperationalState Description Recommendation UUID
HOSTNAME  Disk(xxx) red            red             red                Yes     /Yes                  Stuck I/O is detected Migrate workload & power cycle host
HOSTNAME  Disk(xxx) red            red             red                Yes     /Yes                  Stuck I/O is detected Migrate workload & power cycle host

Logs:

If I/O is stuck or lost on the storage controller or the storage disk, the ESXi storage stack will try to abort them using the task management request displaying these console messages:

2021-06-22T12:02:08.408Z cpu30:1001397101)ScsiDeviceIO: PsaScsiDeviceTimeoutHandlerFn:12834: TaskMgmt op to cancel IO succeeded for device naa.55cd2e404b7736d0 and the IO did not complete. WorldId 0, Cmd 0x28, CmdSN = 0x428.Cancelling of IO will be
2021-06-22T12:02:08.408Z cpu30:1001397101)retried.

If such a lost I/O is found on a host, vSAN will offline the disk to ensure that it doesn't affect other hosts on the cluster as seen in /var/run/log/vobd.log:

2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827057us: [vob.vsan.lsom.stuckiooffline] vSAN device ########-########-####-####-####-########22f8 detected stuck I/O error. Marking the device as offline.
2021-06-22T12:04:04.237Z: [vSANCorrelator] 19607829404us: [esx.problem.vob.vsan.lsom.stuckiooffline] vSAN device ########-########-####-####-####-########22f8 detected stuck I/O error. Marking the device as offline

When Deduplication is not enabled: If the Cache Tier encounters stuck I/O the entire Disk Group it manages will be set to offline state.

When Deduplication is enabled: If stuck I/O is detected on a disk, the entire Disk Group it manages will be set to offline state.

2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607827040us: [vob.vsan.lsom.stuckiopropagated] vSAN device ########-########-####-####-####-########f0bc is under propagated stuck I/O error. Marking the device as offline.
2021-06-22T12:04:04.236Z: [vSANCorrelator] 19607828405us: [esx.problem.vob.vsan.lsom.stuckiopropagated] vSAN device ########-########-####-####-####-########f0bc is under propagated stuck I/O error. Marking the device as offline.

Environment

VMware vSAN 6.7.x
VMware vSAN 7.0.x
VMware vSAN 8.0.x

Resolution

Migrate the workload and power cycle the host. After power cycle of the host, collect the vm-support along with driver/firmware logs.

These issues are seen due to faulty hardware or firmware bugs.

Proceed with opening a case with the hardware vendor.

Expected System Behavior when stuck or lost I/O is detected:

Versions from 7.0 U3: The disk/disk group is set to offline state

Version prior 7.0 U3: The host showed a PSOD to ensure that it doesn't affect other hosts in the cluster.