NVMe Disk transitions to Read-Only State following Stuck I/O and DPC Error
search cancel

NVMe Disk transitions to Read-Only State following Stuck I/O and DPC Error

book

Article ID: 430655

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Stuck I/O event reported for a NVMe disk on an ESXi host.
  • vSAN reports the NVMe disk healthy after going into read only state. 

Environment

VMware vSAN 8.x

 

Cause

At first there was I/Os timeout, then driver receives DPC error event  indicating some PCIe error detected on the device. This confirms that the underlying device had some issue at the time the I/O timeouts occurred and then the disk came back online with NvME disk reporting critical warning  indicating the disk has become read-only. This is a genuine hardware issue and the disk is marked read only at HW level. VSAN fetches the device smart stats and reports the disk as READ-ONLY. vSAN is limited to only reporting of such read only health warning for the NVMe disk.

 

  • PSA issued task management aborts to controller 256:

2026-01-28T02:23:27.060Z In(182) vmkernel: cpu57:2097866)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x430a9dbdda80 CmdSN 0x1c world:0 controller 256 state:5 nsid:1

  • Yellow notification issued to the upper layer of possible stuck I/O condition encountered:

2026-01-28T02:23:29.061Z In(182) vmkernel: cpu57:2097866)StorageDeviceIO: 5608: Task mgmt request issued to device eui.############################ is stuck (WorldID 0, CmdSN 1c). Issuing yellow notification to the application

  • Controller reset event:

2026-01-28T02:23:33.065Z In(182) vmkernel: cpu73:2098043)NVMEIO:4623 Ctlr 256, abort commands stuck, escalate to controller reset
2026-01-28T02:23:33.065Z In(182) vmkernel: cpu73:2098043)NVMEDEV:8245 Resetting controller 256 (nqn.2019-10.com.kioxia:##############:############)
2026-01-28T02:23:33.065Z In(182) vmkernel: cpu73:2098043)NVMEDEV:8260 Controller 256 state changed from 5 to 8(INRESET)

  • There is an error reported from the PCI layer and request made to remove device from device layer. Hardware notifying the ESXi that a hot-plug or error event occurred. The PCIe  Port detected a fatal error from the  device:

2026-01-28T02:23:33.073Z In(182) vmkernel: cpu2:2097563)VMKAcpi: 2414: \_SB_.PC01.D011: 0x4302c6f0a600 received ACPI event 0xf
2026-01-28T02:23:33.073Z In(182) vmkernel: cpu2:2097563)PCIEDPC: 1145: 0000:20:01.1: EDR event received
2026-01-28T02:23:33.074Z In(182) vmkernel: cpu2:2097563)PCIEDPC: 1220: 0000:20:01.1: Port experienced DPC, reason RP PIO error
2026-01-28T02:23:33.074Z In(182) vmkernel: cpu2:2097563)PCIEErrRecov: 194: 0000:20:01.1: Request made to remove device 0000:21:00.0 from device layer

  • VMkernel logs indicate a critical PCIe Hardware Error that triggered a protective mechanism called Downstream Port Containment (DPC).
  • Controller admin queue reset on controller 256:

2026-01-28T02:23:33.075Z In(182) vmkernel: cpu73:2098043)NVMEDEV:8007 Reset admin queue (controller 256)

  • Failed to enable the controller as device is permanently unavailable:

2026-01-28T02:23:33.077Z Wa(180) vmkwarning: cpu73:2098043)WARNING: NVMEDEV:8343 Failed to enable controller 256, status: Device is permanently unavailable

  • PSA device detach event:

2026-01-28T02:23:33.077Z In(182) vmkernel: cpu42:2098075)NvmeAdapter: 3051: Unregistering adapter vmhba6
2026-01-28T02:23:33.077Z In(182) vmkernel: cpu42:2098075)StoragePsaDriver: 634: device 0x1158430d4ce178ee Detach complete [status=Success]
2026-01-28T02:23:33.077Z In(182) vmkernel: cpu42:2098075)Device: 412: storage_psa:driver->ops.detachDevice:0 ms

  • PCI layer reporting error:

2026-01-28T02:23:33.080Z In(182) vmkernel: cpu42:2098075)PCIEErrRecov: 681: 0000:20:01.1: Error containment done

  • PDL set on the NVMe disk:

2026-01-28T02:23:33.277Z Wa(180) vmkwarning: cpu81:2097839)WARNING: StorageDevice: 11908: PDL set on device path vmhba6:C0:T0:L0

  • vSAN reporting disk is offline:

2026-01-28T02:23:33.288Z Wa(180) vmkwarning: cpu29:2099461)WARNING: LSOM: LSOMEventNotify:9026: vSAN device ########-####-####-####-############ has gone offline.

  • The NvME disk reported critical warning  indicating the disk has become read-only:

2026-01-28T02:26:35.677Z In(14) vobd[2097955]:  [vSANCorrelator] 1875399805us: [vob.vsan.lsom.readonlynvmediskhealthcriticalwarning] NVMe critical health warning for disk eui.############################ is: The disk has become read-only.

Resolution

Engage the Hardware vendor to assess the health of the NVMe disk and perform a detailed investigation of the device’s read-only transition and replace if recommended by them.