vSAN Host crashed with a NvME disk related event (PsaStorDeviceVirtResetAll) prior to the PSOD.
search cancel

vSAN Host crashed with a NvME disk related event (PsaStorDeviceVirtResetAll) prior to the PSOD.

book

Article ID: 414989

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • vSAN host crashed with the below PSOD backtrace:

2025-09-12T19:57:44.886Z cpu32:2098027)0x453adb59bdf0:[0x42002d520c57]PsaStorDeviceDrainOutstandingCmds@vmkernel#nover+0x9f stack: 0xffffffff00000000, 0x800000003, 0x101000000000009, 0x10, 0x430bb43d216c
2025-09-12T19:57:44.903Z cpu32:2098027)0x453adb59be50:[0x42002d52123e]PsaStorDeviceVirtResetAll@vmkernel#nover+0x247 stack: 0x300000010, 0x430bb442c370, 0x10edb5fd00d474, 0x430bb43d20c0, 0x430bb43d21f0
2025-09-12T19:57:44.921Z cpu32:2098027)0x453adb59beb0:[0x42002d518296]PsaStorDeviceAPDEventHandler@vmkernel#nover+0x3f3 stack: 0x42002d169864, 0x10edb5fd00d474, 0x0, 0x42002d13fc00, 0x430bb430be10
2025-09-12T19:57:44.938Z cpu32:2098027)0x453adb59bf10:[0x42002d5a63db]PsaStorEventHandlerHelper@vmkernel#nover+0xa8 stack: 0x10edb5fd007c40, 0x10edb9799b7324, 0x453adb59f000, 0x4303c1e01220, 0x430bb43f7400
2025-09-12T19:57:44.956Z cpu32:2098027)0x453adb59bf60:[0x42002d15ba1c]HelperQueueFunc@vmkernel#nover+0x19d stack: 0x430bb43127c8, 0x453adb59f000, 0x453ac351f000, 0x453adb59f100, 0x0
2025-09-12T19:57:44.972Z cpu32:2098027)0x453adb59bfe0:[0x42002d6dc88e]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x42002d144fb0, 0x0, 0x0, 0x0
2025-09-12T19:57:44.985Z cpu32:2098027)0x453adb59c000:[0x42002d144faf]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0

  •  I/O started to timeout and PSA started to abort the I/O :

2025-09-12T19:51:44.753Z cpu10:5051296)NvmeDeviceIO: 1869: Start TSC for CmdSN 71a33 is 2386857226 ms now=2386977304 ms, hardTimeout=120000
2025-09-12T19:51:44.753Z cpu10:5051296)StorageDeviceIO: 5659: Task mgmt request issued to device t10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______############### is stuck (WorldID 0, CmdSN 71a33). Issuing red notification to the application
 

  • Stuck I/O event reported on the device:

2025-09-12T19:51:44.753Z cpu10:5051296)StorageDeviceIO: 5697: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______###############
2025-09-12T19:51:44.753Z cpu10:5051296)StorageDeviceIO: 5738: Releasing sync semaphore for device t10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______############### (WorldID 0, CmdSN 71a33). <---------------
2025-09-12T19:51:44.753Z cpu10:5051296)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x453b1061b488 CmdSN 0x71a33 world:0 controller 261 state:9 nsid:1
2025-09-12T19:51:44.753Z cpu10:5051296)NVMEIO:3974 Ctlr 261, ns 1, tmReq 0x431e51fb0f00, type 1, initiator 0x453b1061b488, sn 0x71a33, world id 0.
 

  • Controller in recovery mode:

2025-09-12T19:51:44.753Z cpu10:5051296)WARNING: NVMEIO:4011 Controller 261 in state 9 or in recovery mode, bail out.^[[0m
2025-09-12T19:51:44.753Z cpu74:2101253)WARNING: NvmeDeviceIO: 1448: Couldn't issue sync command (opcode 2) on device 't10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______###############': Failure^[[0m
2025-09-12T19:51:44.753Z cpu74:2101253)NvmeDeviceIO: 2726: SMART log page request failed for device t10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______############### with H:0x0 D:0x0 P:0x0 
 

  • A virt reset was triggered for a device t10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______############### as part of APD to PDL transition. Task management aborts are issued and wait for all outstanding IOs to be completed/aborted.


2025-09-12T19:51:38+00:00 vm1.example.com vmkernel: cpu15:2098027)NvmeDeviceIO: 3090: Issuing TaskMgmt virt reset to device t10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______###############. worldId=0 cmdId.initiator=0x453b1061b488
<182> 2025-09-12T19:51:38+00:00 vm1.example.com vmkernel: cpu15:2098027)awaited.
<182> 2025-09-12T19:51:38+00:00 vm1.example.com vmkernel: cpu15:2098027)StorageDeviceIO: 3794: Waiting for completion for all issued commands for partition t10.NVMe____Dell_Ent_NVMe_P5500_RI_U.2_3.84TB_______###############:4294967295. Already waited 5 secs. 1 completions still
2025-09-12T19:51:38+00:00 vm1.example.com vmkernel: cpu15:2098027)StorageDeviceIO: 3814: PsaNvmeDeviceIssueTMs() failed while re-issuing TMs for partition t10.NVMe____D
ell_Ent_NVMe_P5500_RI_U.2_3.84TB_______###############:4294967295, returned error (Failure)
 

 

 

Environment

VMware vSAN 8.x

Cause

When commands time out, task management aborts are issued. If these aborts also remain stuck , the process escalates to a controller reset/recovery. In this case, if the controller reset had completed within the expected time, the PSOD would not have occurred.

Resolution

The occurrence of a PSOD is rare and typically indicates an issue with the NVMe disk. Since this points to a hardware-related problem, a thorough hardware diagnostic should be performed by the hardware vendor. Additionally, a fix for this PSOD is included and will be available in the upcoming 8.0.3 P07 release.