NVMe: Intermittent Response during Stuck I/O Events
search cancel

NVMe: Intermittent Response during Stuck I/O Events

book

Article ID: 425451

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Stuck I/O event reported for a NVMe disk on an ESXi host.

Environment

VMware VSAN [All Versions]

 

Cause

This is caused by transient behavior of the NVME disk. The disk intermittently recovered and responded to I/O before becoming unresponsive again.

Sequence of events:

  • I/O timeout detected:

2025-12-13T09:05:10.747Z Wa(180) vmkwarning: cpu5:2098027)WARNING: PLOG: PLOG_DeviceHandleIOTimeOut:8792: vSAN device ########-####-####-####-############ detected I/O timeout error. This may lead to stuck I/O.

  • When NVME aborts are stuck escalated to Controller reset:

2025-12-13T09:05:14.750Z In(182) vmkernel: cpu108:2098238)NVMEDEV:8245 Resetting controller 261 (nqn.1994-11.com.samsung:nvme:PM1733:2.5-inch:##############)
2025-12-13T09:05:14.750Z In(182) vmkernel: cpu108:2098238)NVMEDEV:8260 Controller 261 state changed from 5 to 8(INRESET)

  • Event logged in the regular vSAN trace file indicates disk offline health flag 0x20:

2025-12-13T09:05:39.353231 [52536235] [cpu10] [148bf03d DOM_EVENT] DOMTraceEventHandleHealth:820: {'baseOp': 0x45bb98fb2940, 'healthOpState': 'DONE', 'errorCode': 'VMK_OK', 'uuid': ' ########-####-####-####-############', 'healthFlags': 0x20}

  • When the disk came back online and started logging the disk details again :

2025-12-13T09:05:50.603723 [52543402] [cpu24] [] PLOGTraceDevRefcountInc:1658: {'dev': 0x450182c49e68, 'diskUUID': ' ########-####-####-####-############', 'refCount': 3}

  • PDL set on vmhba4:

2025-12-13T09:07:22.258Z Wa(180) vmkwarning: cpu2:2098039)WARNING: StorageDevice: 11908: PDL set on device path vmhba4:C0:T0:L0

 

Resolution

Engage the hardware vendor to assess the health of the NVME disk which intermittently responds to I/O's.