Stuck I/O event reported for a NVMe disk with transient behavior
search cancel

Stuck I/O event reported for a NVMe disk with transient behavior

book

Article ID: 425445

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Stuck I/O event reported for more than one NVMe disk on an ESXi host.
  • Heartbeat timeout for the VM Namespace reported:

2025-11-24T12:22:44.197Z In(14) vobd[2097954]:  [vmfsCorrelator] 5027159999907us: [esx.problem.vmfs.heartbeat.timedout] ########-#######-####-########### ########-#######-####-###########

  • Heartbeat recovery events for the VM Namespaces reported:

2025-11-24T12:23:31.128Z In(14) vobd[2097954]:  [vmfsCorrelator] 5027206931363us: [esx.problem.vmfs.heartbeat.recovered] ########-#######-####-########### ########-#######-####-###########

Environment

VMware VSAN [All Versions]

 

Cause

The unreliable transient condition of the device increased the time taken to declare PDL .This caused the VMs to reach a threshold post which, they either became unresponsive or crashed.

Sequence of events:

  • PSA issued aborts to controller 256 and 257:

2025-11-24T12:22:11+00:00 vmkernel: cpu72:2098044)NVMEIO:4776 cmd2Abort 0x45de9093b200, opcode 0x2, nsid 1, lba 1877732352, lbc 127

  • NvME abort command itself also got stuck so the abort processing was escalated to controller reset for vmhba7:

2025-11-24T12:22:17+00:00 vmkernel: cpu49:2098044)NVMEDEV:8260 Controller 257 state changed from 5 to 8(INRESET)
2025-11-24T12:22:17+00:00 vmkernel: cpu49:2098044)NVMEDEV:8245 Resetting controller 257 (nqn.1994-11.com.samsung:nvme:#####:2.5-inch:############)
2025-11-24T12:22:17+00:00 vmkernel: cpu49:2098044)NVMEIO:4623 Ctlr 257, abort commands stuck, escalate to controller reset
 

  • Controller 256 (vmhba6)  reset event:

2025-11-24T12:22:32.237Z Wa(180) vmkwarning: cpu49:6775399)WARNING: NVMEIO:4011 Controller 256 in state 8 or in recovery mode, bail out.

  • Disk repair event logged by LSOM :

2025-11-24T12:22:41.074Z In(14) vobd[2097955]:  [vSANCorrelator] 4075554822606us: [esx.problem.vob.vsan.lsom.devicerepair] Device ########-####-####-####-############ is in offline state and is getting repaired.

  • vmware.log of the affected VM shows the write IO successful event indicates the transient condition:

2025-11-24T12:23:33.619Z No(00) Upcall-38797af - UNUSUAL: Successful write to '/vmfs/volumes/vsan:##############-###############/########-####-###-####-#########/vmware.log' took 41.631960 seconds.

  • VM hard reset event after waiting for a long time to get a response to its I/Os:

2025-11-24T12:23:39.105Z In(05) vcpu-0 - Chipset: The guest has requested that the virtual machine be hard reset.

  • PDL event on vmhba6 and vmhba7:

2025-11-24T12:24:30.757Z Wa(180) vmkwarning: cpu4:2097848)WARNING: StorageDevice: 11908: PDL set on device path vmhba6:C0:T0:L0

2025-11-24T12:24:24.586Z Wa(180) vmkwarning: cpu4:2097848)WARNING: StorageDevice: 11908: PDL set on device path vmhba7:C0:T0:L0

Resolution

Engage the hardware vendor for transient issue with the NVMe drives to investigate potential hardware or firmware-related causes, as such errors often originate from underlying hardware issue.