Disk group latency observed on a host in vSAN cluster due to stuck I/O event reported on NvME disk

Products

VMware vSAN

Issue/Introduction

Symptoms:

Disk group latency observed on a host in a vSAN cluster.
Latency spike is observed on a host/cluster level for a moment and back to normal.
I/O timeout event observed for a NvME disk with stuck I/O reported on a hosts in the vmkwarning.log file:

2025-09-08T11:37:15.079Z Wa(180) vmkwarning: cpu6:2098027)WARNING: PLOG: PLOG_DeviceHandleIOTimeOut:8792: vSAN device ########-####-####-####-############ detected I/O timeout error. This may lead to stuck I/O.

Heartbeat timeout for the VM Namespace reported:

2025-09-08T11:37:15.514Z In(14) vobd[2098145]: [vmfsCorrelator] 1968738685513us: [vob.vmfs.heartbeat.timedout] ########-########-####-############ ########-####-####-####-############

Heartbeat recovery events for the VM Namespaces reported.This is where the driver would have got a response from the device post resets that the driver would have performed.The recovery of the vSAN objects and I/O redirection occurred to minimize effect of these issues on the VMs and their services:

2025-09-08T11:37:37.098Z In(14) vobd[2098146]:  [vmfsCorrelator] 1149814230097us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume ########-########-####-############ (########-####-####-####-############): [Timeout] [HB state abcdef02 offset 3702784 gen 81 stampUS 1149814222874 uuid ##########-########-####-############ jrnl <FB 7> drv 24.82]

Environment

VMware vSAN

Cause

The stuck I/O events on the NVMe is caused due to the underlying device not responding to I/Os or aborts.This led to the driver issuing reset, which failed, consequently leading to the device being marked offline by the driver.

PSA issued aborts to the device:

025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu2:2098523)WARNING: NvmeUtil: 151: Error on Cmd(0x45c37f30c840) 0x2, CmdSN 0x1b3c6a89 from world 0 to component "t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________###############"  H:0x6 D:0x0 P:0x0

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45bb30fc2900 failed: ctlr 256, queue 1, psaCmd 0x45c389b7f4c0, status 0x805, opc 0x2, cid 114, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45bb30f00300 failed: ctlr 256, queue 1, psaCmd 0x45c389a0ccc0, status 0x805, opc 0x2, cid 131, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45bb30f89900 failed: ctlr 256, queue 1, psaCmd 0x45c37f2f4e40, status 0x805, opc 0x2, cid 132, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45c39b28eb40 failed: ctlr 256, queue 1, psaCmd 0x45c389aa8cc0, status 0x805, opc 0x2, cid 135, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45bb30f90300 failed: ctlr 256, queue 1, psaCmd 0x45c38a03c7c0, status 0x805, opc 0x2, cid 143, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45c39d62e140 failed: ctlr 256, queue 1, psaCmd 0x45c389ac54c0, status 0x805, opc 0x2, cid 155, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45bb30fe9700 failed: ctlr 256, queue 1, psaCmd 0x45c39d33e700, status 0x805, opc 0x2, cid 190, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu2:2098523)WARNING: NvmeUtil: 151: Error on Cmd(0x45c38c463cc0) 0x2, CmdSN 0x1b3c6a04 from world 0 to component "t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________###############"  H:0x0 D:0x371 P:0x0

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45bb30f0fd00 failed: ctlr 256, queue 1, psaCmd 0x45c38a1f9fc0, status 0x805, opc 0x2, cid 213, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45c39b320940 failed: ctlr 256, queue 1, psaCmd 0x45c389b602c0, status 0x805, opc 0x2, cid 255, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu38:2098198)WARNING: NVMEIO:2645 command 0x45c397802840 failed: ctlr 256, queue 1, psaCmd 0x45c389a306c0, status 0x805, opc 0x2, cid 288, nsid 1

2025-09-08T11:37:16.423Z Wa(180) vmkwarning: cpu2:2098523)WARNING: NvmeUtil: 151: Error on Cmd(0x45c3726f0d00) 0x2, CmdSN 0x1b3c6a87 from world 0 to component "t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________###############"  H:0x6 D:0x0 P:0x0

Controller reset:

025-09-08T11:37:33.082Z Wa(180) vmkwarning: cpu24:10045396)WARNING: NVMEIO:4011 Controller 256 in state 8 or in recovery mode, bail out.

Device repair event:

2025-09-08T11:37:36.414Z Wa(180) vmkwarning: cpu0:2099871)WARNING: PLOG: PLOGHandleTransientErrorInt:5612: vSAN device ########-####-####-####-############ is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device$

2025-09-08T11:37:36.414Z In(14) vobd[2098145]:  [vSANCorrelator] 1968806263750us: [esx.problem.vob.vsan.lsom.devicerepair] Device ########-####-####-####-############ is in offline state and is getting repaired.

LSOM event showing the disk has gone offline :

2025-09-08T11:37:36.424Z Wa(180) vmkwarning: cpu49:2099615)WARNING: LSOM: LSOMEventNotify:9026: vSAN device ########-####-####-####-############ has gone offline.

Task mgmt aborts also stuck:

2025-09-08T11:37:45.082Z In(182) vmkernel: cpu50:10045441)StorageDeviceIO: 5608: Task mgmt request issued to device t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________############### is stuck (WorldID 0, CmdSN 1b3c69f0). Issuing yellow notification to the application

Device under transient error processing:

2025-09-08T11:37:45.082Z Wa(180) vmkwarning: cpu4:2098027)WARNING: PLOG: PLOG_DeviceHandleIOTimeOut:8783: Device ########-####-####-####-############ is under transient error processing

APD event:

2025-09-08T11:37:55.081Z In(14) vobd[2098145]:  [psastorCorrelator] 1968778250781us: [vob.psastor.psastorpath.pathstate.dead] storagePath vmhba2:C0:T0:L0 changed state from on (device ID: t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________###############)

2025-09-08T11:37:55.081Z In(14) vobd[2098145]:  [APDCorrelator] 1968824930353us: [esx.problem.storage.apd.start] Device or filesystem with identifier [t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________###############] has entered the All Paths Down state.

2025-09-08T11:37:57.091Z In(14) vobd[2098145]:  [vSANCorrelator] 1968780261049us: [vob.vsan.pdl.offline] vSAN device ########-####-####-####-############ has gone offline.

Device in PDL state:

2025-09-08T11:37:55.082Z In(14) vobd[2098145]:  [psastorCorrelator] 1968778250844us: [vob.psastor.device.state.permanentloss] Device :t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________############### has been removed or is permanently inaccessible.

2025-09-08T11:37:55.082Z In(14)[+] vobd[2098145]:

2025-09-08T11:37:55.082Z In(14) vobd[2098145]:  [psastorCorrelator] 1968824931292us: [esx.problem.psastor.device.state.permanentloss] Device: t10.NVMe____Dell_Ent_NVMe_CM6_RI_3.84TB_____________############### has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.

2025-09-08T11:37:57.091Z In(14) vobd[2098145]:  [vSANCorrelator] 1968780261049us: [vob.vsan.pdl.offline] vSAN device ########-####-####-####-############ has gone offline.

LSOM event indicating disk not found:

2025-09-08T11:37:57.091Z Wa(180) vmkwarning: cpu20:2099615)WARNING: LSOM: LSOMPostDiskEvent:3788: Throttled: Unable to post disk event for ########-####-####-####-############: Not found

Failed to read the controller :

2025-09-08T11:38:19.975Z Wa(180) vmkwarning: cpu50:2098524)WARNING: NVMEDEV:8343 Failed to enable controller 256, status: Device is permanently unavailable

Resolution

Engage your hardware vendor to investigate potential hardware or firmware-related causes, as such errors often originate from underlying hardware issue.