NVME disk report stuck I/O event fail with DPC error
search cancel

NVME disk report stuck I/O event fail with DPC error

book

Article ID: 420141

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Stuck I/O event reported for a NVMe disk on an ESXi host.

 

 

Environment

VMware VSAN [All Versions]

Cause

At first there was I/Os timeout, then driver receives DPC error event  indicating some PCIe error detected on the device. These events confirm that the underlying device was already in a faulty state at the time the I/O timeouts occurred.

Sequence of events:

  •  I/O started to timeout and PSA started to abort the I/O :

2025-10-20T07:04:35.140Z In(182) vmkernel: cpu1:2098056)NvmeDeviceIO: 1865: Start TSC for CmdSN 402a6fc9 is 6827793373 ms

2025-10-20T07:04:35.140Z In(182) vmkernel: cpu1:2098056)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x430bb43d8700 CmdSN 0x402a6fc9 world:0 controller 260 state:5 nsid:1

2025-10-20T07:04:35.140Z In(182) vmkernel: cpu1:2098056)NVMEIO:3974 Ctlr 260, ns 1, tmReq 0x431e83b73e80, type 1, initiator 0x430bb43d8700, sn 0x402a6fc9, world id 0.

2025-10-20T07:04:35.140Z In(182) vmkernel: cpu10:2098203)NVMEIO:4654 ctlr 260, queue 1, cid 870, cap 0x3, count 0, found cmd 0x45c39a221e00 (initiator 0x430bb43d8700, serialNumber 0x402a6fc9, worldID 0)

2025-10-20T07:04:35.140Z In(182) vmkernel: cpu10:2098203)NVMEIO:4770 Issuing command to cancel cmd 0x45c39a221e00 (tag 0x0) on queue 1, tracker 0x431e83b123c0, cid 870

2025-10-20T07:04:35.140Z In(182) vmkernel: cpu10:2098203)NVMEIO:4776 cmd2Abort 0x45c39a221e00, opcode 0x2, nsid 1, lba 1653849216, lbc 127

 

  • PSA sending stuck I/O event notification alert:
    2025-10-20T07:04:37.140Z In(182) vmkernel: cpu5:2098056)StorageDeviceIO: 5697: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################ 

 

  • PLOG handling the stuck I/O notification alert:
    2025-10-20T07:04:37.140Z Wa(180) vmkwarning: cpu106:2098027)WARNING: PLOG: PLOG_DeviceHandleIOTimeOut:8792: vSAN device 525ca049-####-####-####-190b6fa0656a detected I/O timeout error. This may lead to stuck I/O.

 

  • Driver resetting the controller event:
    2025-10-20T07:04:41.143Z In(182) vmkernel: cpu0:2098205)NVMEDEV:8245 Resetting controller 260 (nqn.2014-08.org.nvmexpress_8086_Dell_Express_Flash_NVMe_P4510_4TB_SFF___################)

 

  • When driver failed to bring the controller online:
    2025-10-20T07:04:41.181Z Wa(180) vmkwarning: cpu1:2098205)WARNING: NVMEDEV:8343 Failed to enable controller 260, status: Device is permanently unavailable

 

  • Driver removing the controller and Generates the PDL event:
    2025-10-20T07:04:41.181Z In(182) vmkernel: cpu80:2098336)NvmeAdapter: 3051: Unregistering adapter vmhba6
    2025-10-20T07:04:41.182Z In(182) vmkernel: cpu80:2098336)StoragePsaDriver: 634: device 0x7b46430e74210533 Detach complete [status=Success]
    2025-10-20T07:04:41.182Z In(182) vmkernel: cpu80:2098336)Device: 412: storage_psa:driver->ops.detachDevice:0 ms
    2025-10-20T07:04:41.182Z In(182) vmkernel: cpu80:2098336)Device: 1721: Unregistered device: 0x430e74201220 logical#pci#p0000:c6:00.0#0#0 com.vmware.StorHBAPort
    2025-10-20T07:04:41.182Z Wa(180) vmkwarning: cpu80:2098336)WARNING: NvmeAdapter: 3155: Releasing adapter vmhba6
    ...
    2025-10-20T07:04:41.182Z Wa(180) vmkwarning: cpu80:2098336)WARNING: NVMEDEV:3236 Failed to read controller csts register, status: Device is permanently unavailable
    ...
    2025-10-20T07:04:41.182Z In(182) vmkernel: cpu80:2098336)nvme_pcie001980000:RemoveDevice:232:Device 0x62d8430e7420f3f3 removed.
    ...
    2025-10-20T07:04:41.371Z In(182) vmkernel: cpu3:2098030)HPP: HppPathGroupMovePath:688: Path "vmhba6:C0:T0:L0" state changed from "active" to "dead"

 

  • PLOG processes the PDL event and marks the disk offline:
    2025-10-20T07:04:41.371Z In(182) vmkernel: cpu3:2098030)PLOG: PLOGLogDiskEvent:4135: Disk Event unplug for MD 525ca049-####-####-####-190b6fa0656a (t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___000194F1BFE4D25C:2)
    ...
    2025-10-20T07:04:41.381Z Wa(180) vmkwarning: cpu44:2099494)WARNING: LSOM: LSOMEventNotify:9026: vSAN device 525ca049-####-####-####-190b6fa0656a has gone offline.

 

  • This log means that the device is removed. There is error reported from the PCI layer and vmkdevmgr request driver to detach the device:

2025-10-20T07:04:41.179Z In(182) vmkernel: cpu1:2097691)PCIEDPC: 1220: 0000:c0:03.3: Port experienced DPC, reason RP PIO error
2025-10-20T07:04:41.179Z In(182) vmkernel: cpu1:2097691)PCIEErrRecov: 194: 0000:c0:03.3: Request made to remove device 0000:c6:00.0 from device layer
...
2025-10-20T07:04:41.179Z In(182) vmkernel: cpu80:2098336)nvme_pcie:ForgetDevice:411:Called with 0x6225430e7420c5b2.
2025-10-20T07:04:41.179Z In(182) vmkernel: cpu80:2098336)nvme_pcie001980000:ForgetDevice:419:Device 0x6225430e7420c5b2 forgotten.

Resolution

Engage hardware vendor to asses the health of the NVMe disk reporting DPC error