2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bdd0:[0x42000c58b8c5]TimerInsert@vmkernel#nover+0x61 stack: 0x453a4ec1f300, 0x0, 0x42000cad4e50, 0x19fc50542d3d1, 0xffffffffffffffff
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bde0:[0x42000c58c958]Timer_AddTCWithLockDomain@vmkernel#nover+0x14d stack: 0x42000cad4e50, 0x19fc50542d3d1, 0xffffffffffffffff, 0x0, 0x41ffcc8507e0
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1be60:[0x42000c58ca37]Timer_AddTC@vmkernel#nover+0x24 stack: 0x41ffcc518a40, 0x2001, 0x453a4ec1bee0, 0x19fc4f5ad64fd, 0x453a4ec1bee0
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bea0:[0x42000cad9194]CpuSchedSleepUntilTC@vmkernel#nover+0x99 stack: 0x2001, 0x453a4ec1bee0, 0x43030640ab90, 0x1168f9a5255fb, 0x1
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bf60:[0x42000c55f0e1]IntrCookieRetireLoop@vmkernel#nover+0x1f2 stack: 0x180, 0x453a000000e0, 0x42004c000250, 0x94, 0x30
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bfe0:[0x42000cad67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x42000c544cf0, 0x0, 0x0, 0x0
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1c000:[0x42000c544cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0
2025-07-29T01:55:28.645Z In(182) vmkernel: cpu0:2098150)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x########## cmdId.serialNumber=0x#####)
2025-07-29T01:55:28.645Z In(182) vmkernel: cpu0:2098150)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x########## cmdId.serialNumber=0x#####)
2025-07-29T01:55:28.645Z In(182) vmkernel: cpu0:2098150)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x########## cmdId.serialNumber=0x#####)
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu68:2097718)NvmeDeviceIO: 1853: Start TSC for CmdSN d841 is 169017056 ms
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu68:2097718)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x453a7261b488 CmdSN 0xd841 world:0 controller 260 state:5 nsid:1
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu68:2097718)NVMEIO:3971 Ctlr 260, ns 1, tmReq 0x4321d9f74d80, type 1, initiator 0x453a7261b488, sn 0xd841, world id 0.
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu57:2097907)NVMEIO:4614 ctlr 260, queue 0, cid 288, cap 0x3, count 0, found cmd 0x45daa2ac18c0 (initiator 0x453a7261b488, serialNumber 0xd841, worldID 0)
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu57:2097907)NVMEIO:4730 Issuing command to cancel cmd 0x############ (tag 0x0) on queue 0, tracker 0x4321d9d78f20, cid 288
2025-07-29T01:55:23.754Z In(182) vmkernel: cpu68:2097718)NvmeDeviceIO: 1857: Start TSC for CmdSN d841 is 169017056 ms now=169019058 ms, hardTimeout=120000
2025-07-29T01:55:23.754Z In(182) vmkernel: cpu68:2097718)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x############ CmdSN 0xd841 world:0 controller 260 state:5 nsid:1
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEIO:4583 Ctlr 260, abort commands stuck, escalate to controller reset
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:8194 Resetting controller 260 (nqn.####-##.###.nvmexpress_####_Dell_Express_Flash_NVMe_P4800X_375GB_SFFPHKE214000BC375AGN)
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:8209 Controller 260 state changed from 5 to 8(INRESET)
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:3676 Unbinding world 2098150 with interrupt cookie 0x15f for controller 260 queue 1...
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:3676 Unbinding world 2098151 with interrupt cookie 0x160 for controller 260 queue 2...
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu16:2097908)NVMEDEV:2166 Ctlr 260, deleting queue 1
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu62:2097903)NVMEDEV:2166 Ctlr 260, deleting queue 2
2025-07-29T01:55:33.768Z Wa(180) vmkwarning: cpu27:2097909)WARNING: NVMEDEV:3037 Controller cannot be disabled, status: Timeout
2025-07-29T01:55:33.768Z Wa(180) vmkwarning: cpu27:2097909)WARNING: NVMEDEV:8261 Failed to disable controller 260, status: Timeout
2025-07-29T01:55:33.768Z In(182) vmkernel: cpu27:2097909)NVMEDEV:9552 Request to start controller 260 recovery
2025-07-29T01:55:33.768Z In(182) vmkernel: cpu27:2097909)NVMEDEV:9583 Starting controller 260 recovery.
2025-07-29T01:55:40.712Z Wa(180) vmkwarning: cpu2:2097909)WARNING: NVMEDEV:3045 Controller cannot be disabled, status: Timeout
2025-07-29T01:56:31.787Z Wa(180) vmkwarning: cpu5:2097909)WARNING: NVMEDEV:3045 Controller cannot be disabled, status: Timeout
VMware vSAN 8.x
IO timeouts triggered the NVME taskmgmt abort processing to cancel the IO timeouts, however the NVMe abort command was stuck. As part of controller reset process, the controller needs to be disabled first, but the disable action itself also timed out. This usually indicates there is some hardware issue. As a result, nvme controller reset fails and driver puts the controller into FAILED state, so all I/Os issued to the device will fail.
The PSOD happen due to an improper handling on the timeout I/Os.
This is a rare occurrence that does not frequently happen and is expected to be resolved in an upcoming release.
As for the NVMe disk/controller on which the reset is seen, engage the hardware vendor and have the disk replaced. Run hardware diagnostics to vet the controller health.
Contact Broadcom Technical Support for more details.