vSAN Host PSODs with generic PSOD message (TimerInsert, Timer_AddTCWithLockDomain) and vSAN disk related events were noticed prior to the PSOD.

Products

VMware vSAN

Issue/Introduction

Symptoms:

vSAN host fails with the below backtrace:

2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bdd0:[0x42000c58b8c5]TimerInsert@vmkernel#nover+0x61 stack: 0x453a4ec1f300, 0x0, 0x42000cad4e50, 0x19fc50542d3d1, 0xffffffffffffffff
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bde0:[0x42000c58c958]Timer_AddTCWithLockDomain@vmkernel#nover+0x14d stack: 0x42000cad4e50, 0x19fc50542d3d1, 0xffffffffffffffff, 0x0, 0x41ffcc8507e0
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1be60:[0x42000c58ca37]Timer_AddTC@vmkernel#nover+0x24 stack: 0x41ffcc518a40, 0x2001, 0x453a4ec1bee0, 0x19fc4f5ad64fd, 0x453a4ec1bee0
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bea0:[0x42000cad9194]CpuSchedSleepUntilTC@vmkernel#nover+0x99 stack: 0x2001, 0x453a4ec1bee0, 0x43030640ab90, 0x1168f9a5255fb, 0x1
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bf60:[0x42000c55f0e1]IntrCookieRetireLoop@vmkernel#nover+0x1f2 stack: 0x180, 0x453a000000e0, 0x42004c000250, 0x94, 0x30
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1bfe0:[0x42000cad67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x42000c544cf0, 0x0, 0x0, 0x0
2025-07-29T03:55:42.204Z cpu48:2097624)0x453a4ec1c000:[0x42000c544cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0

The vmkernel logs may show a high volume of NVMe-related errors and warnings leading up to the PSOD, The repeated messages suggest instability with the NVMe devices.

2025-07-29T01:55:28.645Z In(182) vmkernel: cpu0:2098150)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x########## cmdId.serialNumber=0x#####)
2025-07-29T01:55:28.645Z In(182) vmkernel: cpu0:2098150)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x########## cmdId.serialNumber=0x#####)
2025-07-29T01:55:28.645Z In(182) vmkernel: cpu0:2098150)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x########## cmdId.serialNumber=0x#####)

Prior to the PSOD, the vmkernel logs show the following messages. An IO timeout happened which triggered NVMe task management abort processing to cancel the timed-out command.

2025-07-29T01:55:21.752Z In(182) vmkernel: cpu68:2097718)NvmeDeviceIO: 1853: Start TSC for CmdSN d841 is 169017056 ms
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu68:2097718)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x453a7261b488 CmdSN 0xd841 world:0 controller 260 state:5 nsid:1
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu68:2097718)NVMEIO:3971 Ctlr 260, ns 1, tmReq 0x4321d9f74d80, type 1, initiator 0x453a7261b488, sn 0xd841, world id 0.
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu57:2097907)NVMEIO:4614 ctlr 260, queue 0, cid 288, cap 0x3, count 0, found cmd 0x45daa2ac18c0 (initiator 0x453a7261b488, serialNumber 0xd841, worldID 0)
2025-07-29T01:55:21.752Z In(182) vmkernel: cpu57:2097907)NVMEIO:4730 Issuing command to cancel cmd 0x############ (tag 0x0) on queue 0, tracker 0x4321d9d78f20, cid 288
2025-07-29T01:55:23.754Z In(182) vmkernel: cpu68:2097718)NvmeDeviceIO: 1857: Start TSC for CmdSN d841 is 169017056 ms now=169019058 ms, hardTimeout=120000
2025-07-29T01:55:23.754Z In(182) vmkernel: cpu68:2097718)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x############ CmdSN 0xd841 world:0 controller 260 state:5 nsid:1

However, the NVMe abort command itself also got stuck, so the abort processing was escalated to controller reset causing IO queues to be deleted.

2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEIO:4583 Ctlr 260, abort commands stuck, escalate to controller reset
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:8194 Resetting controller 260 (nqn.####-##.###.nvmexpress_####_Dell_Express_Flash_NVMe_P4800X_375GB_SFFPHKE214000BC375AGN)
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:8209 Controller 260 state changed from 5 to 8(INRESET)
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:3676 Unbinding world 2098150 with interrupt cookie 0x15f for controller 260 queue 1...
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu27:2097909)NVMEDEV:3676 Unbinding world 2098151 with interrupt cookie 0x160 for controller 260 queue 2...
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu16:2097908)NVMEDEV:2166 Ctlr 260, deleting queue 1
2025-07-29T01:55:27.758Z In(182) vmkernel: cpu62:2097903)NVMEDEV:2166 Ctlr 260, deleting queue 2

As part of controller reset process, the controller needs to be disabled first, but the disable action itself also timed out. This usually indicates there is some hardware issue.

2025-07-29T01:55:33.768Z Wa(180) vmkwarning: cpu27:2097909)WARNING: NVMEDEV:3037 Controller cannot be disabled, status: Timeout
2025-07-29T01:55:33.768Z Wa(180) vmkwarning: cpu27:2097909)WARNING: NVMEDEV:8261 Failed to disable controller 260, status: Timeout
2025-07-29T01:55:33.768Z In(182) vmkernel: cpu27:2097909)NVMEDEV:9552 Request to start controller 260 recovery
2025-07-29T01:55:33.768Z In(182) vmkernel: cpu27:2097909)NVMEDEV:9583 Starting controller 260 recovery.

As the controller reset process failed, the controller entered recovery mode. Since then, the controller recovery process can be seen as running (by conducting controller reset again) but the controller was never recovered due to the same failure (controller disable timeout).

2025-07-29T01:55:40.712Z Wa(180) vmkwarning: cpu2:2097909)WARNING: NVMEDEV:3045 Controller cannot be disabled, status: Timeout 
2025-07-29T01:56:31.787Z Wa(180) vmkwarning: cpu5:2097909)WARNING: NVMEDEV:3045 Controller cannot be disabled, status: Timeout

Eventually the PSOD is noticed with the backtrace mentioned above.

Environment

VMware vSAN 8.x

Cause

IO timeouts triggered the NVME taskmgmt abort processing to cancel the IO timeouts, however the NVMe abort command was stuck. As part of controller reset process, the controller needs to be disabled first, but the disable action itself also timed out. This usually indicates there is some hardware issue. As a result, nvme controller reset fails and driver puts the controller into FAILED state, so all I/Os issued to the device will fail.

The PSOD happen due to an improper handling on the timeout I/Os.

Resolution

This is a rare occurrence that does not frequently happen and is expected to be resolved in an upcoming release.

As for the NVMe disk/controller on which the reset is seen, engage the hardware vendor and have the disk replaced. Run hardware diagnostics to vet the controller health.

Contact Broadcom Technical Support for more details.