vSAN Host PSODs with a failing disk

Products

VMware vSAN

Issue/Introduction

vSAN host fails with the below backtrace:

0x0000420031dc0bf2 in DDPInitDev (idx=0, ctx=0x4314f5201220) at bora/modules/vmkernel/plog/dedup/dedup.c:9117
DDP_InitDev (h=h@entry=0x4314f5201220, dev=dev@entry=0x450200047c48) at bora/modules/vmkernel/plog/dedup/dedup.c:9372
0x0000420031d2b6d1 in PLOGDedupInitDev (elem=<optimized out>, private=<optimized out>) at bora/modules/vmkernel/plog/plog.c:10328
0x0000420031b0b7c3 in VSANUUIDTable_Iterate (table=0x450200003550, itrFn=itrFn@entry=0x420031d2b330 <PLOGDedupInitDev>, private=private@entry=0x0) at bora/modules/vmkernel/vsanutil/vsan_uuid_table.c:305
0x0000420031d2c612 in PLOGDeviceRecoveryDone (fromHelper=<optimized out>) at bora/modules/vmkernel/plog/plog.c:10797
0x0000420031d4b7be in PLOGDeviceRecoveryCompleteHelper (data=0x450204794da8) at bora/modules/vmkernel/plog/plog.c:10829
0x000042003015b8c0 in HelperProcessRequest (prevIRQL=<synthetic pointer>, helper=0x450200002b30, queue=0x450200002510) at bora/vmkernel/main/helper.c:599
HelperQueueFunc (data=0x450200002b30) at bora/vmkernel/main/helper.c:671
0x00004200306d67b3 in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:15324
0x0000420030144cf0 in ?? () at bora/vmkernel/main/debug.c:4125
0x0000000000000000 in ?? ()

In vmkernel.log just prior to the PSOD event the below messages are seen:

2025-06-26T11:50:53.815Z Wa(180) vmkwarning: cpu1:2097696)WARNING: HPP: HppDeviceUpdateState:5269: Device 't10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################' is changing to 'APD' from 'on'.
2025-06-26T11:50:53.815Z In(182) vmkernel: cpu1:2097696)StorageDevice: 803: State change transition from state:Registered, to state:All Paths Down Started fordevice:t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################
2025-06-26T11:50:53Z In(182) vmkernel:
2025-06-26T11:50:53.815Z In(182) vmkernel: cpu1:2097696)PLOG: PLOGLogDiskEvent:4135: Disk Event unplug for MD 52e07f1c-b35a-2211-fde4-############ (t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################:2)
2025-06-26T11:50:53.815Z Wa(180) vmkwarning: cpu1:2097696)WARNING: StorageDevice: 11908: PDL set on device path vmhba#:C0:T0:L0
2025-06-26T11:50:53.815Z In(182) vmkernel: cpu1:2097696)StorageDevice: 803: State change transition from state:All Paths Down Started, to state:Permanent device loss fordevice:t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################
2025-06-26T11:50:53Z In(182) vmkernel:
2025-06-26T11:50:53.815Z Wa(180) vmkwarning: cpu1:2097696)WARNING: StorageDevice: 5587: Device :t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################ has been removed or is permanently inaccessible.

Additional the vmhba of the failing disk is seen as getting repeatedly removed:

2025-06-26T11:50:52.824Z Wa(180) vmkwarning: cpu41:2097932)WARNING: NvmeDiscover: 5489: Mark path vmhba#:C0:T0:L0 as NO_CONNECT
2025-06-26T11:50:52.824Z In(182) vmkernel: cpu41:2097932)NVMEPSA:1631 adpater: vmhba#, action: 1
2025-06-26T11:50:52.824Z In(182) vmkernel: cpu41:2097932)NvmeAdapter: 3015: Unregistering adapter vmhba#
2025-06-26T11:50:52.824Z In(182) vmkernel: cpu41:2097932)StoragePsaDriver: 634: device 0x3147430daee2544f Detach complete [status=Success]
2025-06-26T11:50:52.824Z In(182) vmkernel: cpu41:2097932)Device: 412: storage_psa:driver->ops.detachDevice:0 ms
2025-06-26T11:50:52.824Z In(182) vmkernel: cpu41:2097932)Device: 1721: Unregistered device: 0x430daee01220 logical#pci#p0000:af:00.0#0#0 com.vmware.StorHBAPort
2025-06-26T11:50:52.824Z Wa(180) vmkwarning: cpu41:2097932)WARNING: NvmeAdapter: 3119: Releasing adapter vmhba#
2025-06-26T11:50:52.824Z In(182) vmkernel: cpu41:2097932)Device: 676: nvmeBusDriver:ops->removeDevice:0 ms

WARNING: HPP: HppNvmeThrottleLogForDevice:600: NVMe Cmd 0x2 (0x45df050d8800, 0) to dev "t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################" on path "vmhba#:C0:T0:L0" Failed:^[[0m
^[[7m2025-06-28T15:54:40.704Z cpu36:2097280)WARNING: HPP: HppNvmeThrottleLogForDevice:608: Error status H:0xe D:0x0 P:0x0 hppAction = 2^[[0m <== (hppAction = HPP_PATH_ACTION_FAILOVER, H:VMK_NVME_HOST_STATUS_VMW_NO_CONNECT)

2025-06-28T15:54:41.215Z cpu8:2097696)HPP: HppPathGroupMovePath:688: Path "vmhba#:C0:T0:L0" state changed from "active" to "dead"

And in vobd.log:

2025-06-26T11:50:53.816Z In(14) vobd[2097811] [psastorCorrelator] 4351169685188us: [vob.psastor.device.state.permanentloss] Device :t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################ has been removed or is permanently inaccessible.
2025-06-26T11:50:53.816Z In(14) vobd[2097811] [psastorCorrelator] 4351354214140us: [esx.problem.psastor.device.state.permanentloss] Device: t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___################ has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.
2025-06-26T11:50:53.825Z In(14) vobd[2097811] [vSANCorrelator] 4351169695301us: [vob.vsan.pdl.offline] vSAN device 52e07f1c-b35a-2211-fde4-############ has gone offline.

Environment

VMware vSAN OSA (All Versions)

Cause

This is caused due to an underlying device issue which revealed a synchronization discrepancy between the disk-group recovery thread and the transient error handling workflow. Consequently, accessing the disk-group handle, which was removed as part of the transient error handling thread resulting in the host to PSOD.

This is not specific to just NVMe disks it's any disk type certified for vSAN.

Resolution

This is a rare occurrence that does not frequently happen and is expected to be resolved in an upcoming release.

As for the failed disk engage the hardware vendor and have the disk replaced. Run hardware diagnostics to vet the controller health.