Symptoms:
- Stuck I/O event detected on NvME disk with no impact to the VM's.
- vSAN reports heartbeat timeout events for VM namespaces with no impact to the running VM's:
2025-08-19T19:46:30.755Z In(14) vobd[2098144]: [vmfsCorrelator] 1536793531419us: [vob.vmfs.heartbeat.timedout] ##########-########-####-############ ##########-########-####-############
- NvME abort command itself also got stuck so the abort processing was escalated to controller reset:
2025-08-19T19:46:48.789Z In(182) vmkernel: cpu16:2098241)NVMEDEV:8245 Resetting controller 261 (nqn.1994-11.com.samsung:nvme:PM1733:2.5-inch:##############)
2025-08-19T19:46:48.789Z In(182) vmkernel: cpu16:2098241)NVMEDEV:8260 Controller 261 state changed from 5 to 8(INRESET)
2025-08-19T19:46:48.801Z In(182) vmkernel: cpu85:2098267)NVMEDEV:8007 Reset admin queue (controller 261)
2025-08-19T19:46:48.801Z In(182) vmkernel: cpu85:2098267)NVMEDEV:8016 Controller 261, admin queue reset complete. Status Success
- The ESXi host is encountering problems communicating with its NVMe storage devices. The NVMe devices are either resetting or aborting commands prematurely, leading to I/O failures that the ESXi host then tries to retry.
2025-08-19T19:46:48.798Z In(182) vmkernel: cpu6:2098471)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x430bb43ef4c0 cmdId.serialNumber=0xb533e0c)
2025-08-19T19:46:48.798Z In(182) vmkernel: cpu6:2098471)NvmeUtil: 502: Transient status for command 0x2 set to VMK_STORAGE_RETRY_OPERATION because of an abort/reset before the command timed out: cmdId.initiator=0x430bb43ef4c0 cmdId.serialNumber=0xb533e2e)
- vSAN reports heartbeat recovered events for VM namespaces with no impact to the running VM's:
2025-08-19T19:46:49.538Z In(14) vobd[2098144]: [vmfsCorrelator] 1536812314410us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume ##########-########-####-############ (##########-####-####-####-############): [Timeout] [HB state abcdef02 offset 4161536 gen 7 stampUS 1536812310470 uuid 688d617d-########-####-00620b1df240 jrnl <FB 15> drv 24.82]
- Device in PDL :
2025-08-19T19:46:49.529Z In(182) vmkernel: cpu5:2098030)PLOG: PLOGLogDiskEvent:4135: Disk Event unplug for MD522330d1-####-####-####-e9f172aff543 (eui..################################::2)
2025-08-19T19:46:49.529Z Wa(180) vmkwarning: cpu5:2098030)WARNING: StorageDevice: 11908: PDL set on device path vmhba4:C0:T0:L0
2025-08-19T19:46:49.529Z In(182) vmkernel: cpu5:2098030)StorageDevice: 803: State change transition from state:All Paths Down Started, to state:Permanent device loss fordevice:eui..################################:
2025-08-19T19:46:49.529Z Wa(180) vmkwarning: cpu5:2098030)WARNING: StorageDevice: 5587: Device :eui..################################:has been removed or is permanently inaccessible.
2025-08-19T19:46:49.529Z In(182) vmkernel: cpu5:2098030)StorageDevice: 4087: Device state of eui...################################:set to APD_START; token num:1
2025-08-19T19:46:49.529Z Wa(180) vmkwarning: cpu5:2098030)WARNING: HPP: HppDeviceUpdateState:5279: Device 'eui..################################:' is changing to 'APD' from 'permanent device loss'.
2025-08-19T19:46:49.529Z In(182) vmkernel: cpu1:2098028)StorageApdHandlerEv: 106: Device or filesystem with identifier [eui..################################:] has entered the All Paths Down state.
2025-08-19T19:46:49.530Z Wa(180) vmkwarning: cpu4:2097303)WARNING: NvmeDeviceIO: 1737: Command 0x2 to device "eui..################################:" marked for PDL virtual reset completed with abort/reset: cmdId.initiator=0x430bb43ef4c0 cmdId
- LSOM event indicating the disk has gone offline:
2025-08-19T19:46:49.540Z Wa(180) vmkwarning: cpu22:2099543)WARNING: LSOM: LSOMEventNotify:9026: vSAN device 522330d1-####-####-####-e9f172aff543 has gone offline.
- IO timeout Detected:
2025-08-19T19:47:16.533Z In(182) vmkernel: cpu11:8324394)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1193: Setting devResState : dev:522330d1-####-####-####-e9f172aff543 cState: 5 nState: 11 isLSE: 0
2025-08-19T19:47:16.533Z Wa(180) vmkwarning: cpu11:8324394)WARNING: PLOG: PLOG_DeviceHandleIOTimeOut:8792: vSAN device 522330d1-####-####-####-e9f172aff543 detected I/O timeout error. This may lead to stuck I/O.
- Outstanding IO detected:
2025-08-19T19:47:17.494Z In(182) vmkernel: cpu14:2098027)StorageDeviceIO: 3748: Number of outstanding I/Os is:54
- Outstanding IO has reduced to 0 and PDL is progressing :
2025-08-19T19:47:23.546Z In(182) vmkernel: cpu11:2098027)StorageDeviceIO: 3748: Number of outstanding I/Os is:0
2025-08-19T19:47:23.546Z In(182) vmkernel: cpu11:2098027)StorageDeviceIO: 3751: All issued IOs completed for partition eui..################################::4294967295. Had to wait for 33700 msecs.
2025-08-19T19:47:23.546Z In(182) vmkernel: cpu11:2098027)PLOG: PLOGMDAPDCallback:2312: PDL already in process on MD:522330d1-####-####-####-###########
2025-08-19T19:47:23.546Z In(182) vmkernel: cpu11:2098027)StorageDevice: 10570: Device eui..################################:APD Notify PERM LOSS; token num:1
- After queue is clear the APD is cancelled :
2025-08-19T19:47:23.546Z In(182) vmkernel: cpu11:2098027)StorageApdHandler: 1307: APD cancelled for 0x430bb438cdf0 [eui..################################:]
- After cleaning stale handles Device adding starts:
2025-08-19T19:47:23.661Z In(182) vmkernel: cpu9:2098888)PLOG: PLOG_RecoveredDisksInsert:420: Adding device 522330d1-####-####-####-e9f172aff543 state to recovered devices 0x4502ee436b08
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu9:2098888)PLOG: PLOGUnregisterAPDCallback:2472: Successfully unregistered APD event for vSAN device: 522330d1-####-####-####-e9f172aff543
- Stuck IO events are unregistered :
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu9:2098888)PsaStorEvents: 477: EventSubsystem: Device Events - Internal, Event Mask: 11, Parameter: 0x450280032c08, UnRegistered!
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu9:2098888)PLOG: PLOGUnregisterEventStuckIO:9104: Successfully unregistered device event to detect stuck I/Os for vSAN device: 522330d1-####-####-####-e9f172aff543
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu9:2098888)PsaStorEvents: 477: EventSubsystem: Device Events - Internal, Event Mask: 12, Parameter: 0x450280032fd8, UnRegistered!
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu9:2098888)PLOG: PLOGUnregisterEventErrCategorization:9053: Successfully unregistered error categorization event for vSAN device: 522330d1-####-####-####-e9f172aff543
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu9:2098888)StorageDevice: 9805: Unplug request for eui..################################:
- The APD handle is freed from disk and device is initialized:
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu12:2098034)StorageApdHandler: 1051: Freeing APD handle 0x430bb438cdf0 [eui.################################]
2025-08-19T19:47:23.685Z In(182) vmkernel: cpu12:2098034)StorageApdHandler: 1135: APD Handle freed!
2025-08-19T19:48:17.531Z In(182) vmkernel: cpu68:2098807)StorageDevice: 1796: Adding uid for device eui..################################:= t10.NVMe____Dell_Ent_NVMe_v2_AGN_RI_U.2_3.84TB______##############______########
2025-08-19T19:48:17.534Z In(182) vmkernel: cpu68:2098807)StorageDevice: 1903: Successfully registered device "eui..################################:" from plugin "HPP" of type 0
2025-08-19T19:48:17.543Z In(182) vmkernel: cpu1:2098024)PLOG: PLOG_InitDevice:273: Initialized device M eui..################################::1 0x450283f66e08 quiesceTask 0x4502ee2729f8 on SSD 52472f6c-####-###-####-############bdeviceUUID 00000000-0000-0000-0000-00000000$
- All APD events unregistered successfully :
2025-08-19T19:48:18.555Z In(182) vmkernel: cpu12:2098024)PLOG: PLOGUnregisterAPDCallback:2472: Successfully unregistered APD event for vSAN device: 522330d1-####-####-####-e9f172aff543
- Device added to vSAN and online:
2025-08-19T19:48:18.566Z In(182) vmkernel: cpu12:2098024)PLOG: PLOGInitAndAnnounceMD:10624: Successfully announced VSAN MD with UUID: 522330d1-####-####-####-e9f172aff543. kt 1, en 0, enC 0.
2025-08-19T19:48:18.566Z In(182) vmkernel: cpu12:2098024)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1193: Setting devResState : dev:522330d1-####-####-####-e9f172aff543 cState: 11 nState: 1 isLSE: 0
2025-08-19T19:48:18.566Z Wa(180) vmkwarning: cpu12:2098024)WARNING: PLOG: PLOGProcessHotpluggedDevice:10884: vSAN device 522330d1-####-####-####-e9f172aff543 has come online.