NVME disk report stuck I/O with PCIE hot plug event detected

Products

VMware vSAN

Issue/Introduction

Symptoms:

Stuck I/O event reported for a NVMe disk on an ESXi host.

Heartbeat timeout for the VM Namespace reported:

2026-01-01T23:43:42.982Z In(14) vobd[2098145]: [vmfsCorrelator] 1968738685513us: [vob.vmfs.heartbeat.timedout] ########-########-####-############ ########-####-####-####-############

Heartbeat recovery events for the VM Namespaces reported. This is where the driver would have got a response from the device post resets that the driver would have performed. The recovery of the vSAN objects and I/O redirection occurred to minimize effect of these issues on the VMs and their services:

2026-01-01T23:44:32.510Z  In(14) vobd[2098146]:  [vmfsCorrelator] 1149814230097us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume ########-########-####-############ (########-####-####-####-############): [Timeout] [HB state abcdef02 offset 3702784 gen 81 stampUS 1149814222874 uuid ##########-########-####-############ jrnl <FB 7> drv 24.82]

Environment

VMware VSAN [All Versions]

Cause

PCIe slot detected device insertion event indicating underlying device issue.

Sequence of events observed for controller 261 :

The disk momentarily lost contact with the PCIe bus:

2026-01-01T23:43:33.469Z Wa(180) vmkwarning: cpu0:2097948)WARNING: PCIEHP: 765: 0000:40:01.2: hotplug slot: 0xa1: Device insertion detected while prior device 0000:42:00.0 removal is still pending

PSA issued aborts:

2026-01-01T23:44:08.478Z Wa(180) vmkwarning: cpu86:2098631)WARNING: NVMEIO:2645 command 0x45c0d4fe96c0 failed: ctlr 261, queue 1, psaCmd 0x45c0debd95c0, status 0x805, opc 0x2, cid 0, nsid 1

Read command failure to the controller 261(vmhba1):

2026-01-01T23:44:08.478Z Wa(180) vmkwarning: cpu6:2098864)WARNING: HPP: HppNvmeThrottleLogForDevice:600: NVMe Cmd 0x2 (0x45c0debd95c0, 0) to dev "eui.#############################" on path "vmhba1:C0:T0:L0" Failed:

Aborts issued to the device:

2026-01-01T23:44:02+00:00 vmkernel: cpu12:2098625)NVMEIO:4776 cmd2Abort 0x45c0d4f4b2c0, opcode 0x2, nsid 1, lba 192575968, lbc 7
2026-01-01T23:44:02+00:00 vmkernel: cpu16:2098439)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x430c2162f300 CmdSN 0x48e345eb world:0 controller 261 state:5 nsid:1
2026-01-01T23:44:02+00:00 vmkernel: cpu86:2098631)NVMEIO:4770 Issuing command to cancel cmd 0x45c0d4f9fcc0 (tag 0x0) on queue 1, tracker 0x431e9734f3f8, cid 871

Aborts commands stuck escalated to controller reset:

2026-01-01T23:44:08+00:00 vmkernel: cpu0:2098629)NVMEIO:4623 Ctlr 261, abort commands stuck, escalate to controller reset

Controller reset event:

2026-01-01T23:44:08.478Z Wa(180) vmkwarning: cpu86:2098631)WARNING: NVMEIO:2645 command 0x45c0d4e1f8c0 failed: ctlr 261, queue 1, psaCmd 0x45c0dea8f5c0, status 0x805, opc 0x2, cid 196, nsid 1

Controller in recovery mode:

2026-01-01T23:44:14.476Z Wa(180) vmkwarning: cpu45:18556647)WARNING: NVMEIO:4011 Controller 261 in state 8 or in recovery mode, bail out.

Disk repair event:

2026-01-01T23:44:32.435Z Wa(180) vmkwarning: cpu5:2098864)WARNING: PLOG: PLOGHandleTransientErrorInt:5612: vSAN device ########-####-####-####-########### is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device$

Disk has gone offline event:

2026-01-01T23:44:56.480Z In(14) vobd[2098531]: [vSANCorrelator] 4287229562180us: [esx.problem.vob.vsan.pdl.offline] vSAN device ########-####-####-####-############ has gone offline.

Resolution

Engage hardware vendor to asses the health of the NVMe disk reporting PCIE hot plug controller error.