Warning: 'One of the disks is detected with PDL in vSAN ESA Cluster. Please check the host for further details' on vSAN ESA cluster after updating the BIOS and Firmware on server.

search cancel

Warning: 'One of the disks is detected with PDL in vSAN ESA Cluster. Please check the host for further details' on vSAN ESA cluster after updating the BIOS and Firmware on server.

book

Article ID: 398881

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptom:

After updating the BIOS and firmware, NVMe drives started experiencing PDL issues.
vSAN node shows warning on vCenter as: "One of the disks is detected with PDL in vSAN ESA Cluster. Please check the host for further details"

vSAN objects to go inaccessible and cause VMs to crash.
Unable to perform the vmotion of VM getting an error as "Module VM power on failed"
Post vSAN node reboot, disks are either missing or not showing up in healthy state from hardware interface.

Environment

VMware vSAN 8.x

Cause

This issue typically occurs when the storage device becomes completely inaccessible at the hardware layer, often caused by stuck I/O operations or an underlying physical disk fault

In "var/run/log/vmkernel.log", we see below entries:

YYYY-MM-DDTHH:MM:SS Wa(180) vmkwarning: cpu23:2097647)WARNING: HPP: HppDeviceUpdateState:5269: Device 't10.NVMe____######################______________________########' is changing to 'APD' from 'permanent device loss'.
YYYY-MM-DDTHH:MM:SS Wa(180) vmkwarning: cpu11:2097644)WARNING: NvmeDeviceIO: 1725: Command 0x9 to device "t10.NVMe____######################______________________########" marked for PDL virtual reset completed with abort/reset: cmdId
YYYY-MM-DDTHH:MM:SS Wa(180) vmkwarning: cpu11:2097644)WARNING: initiator=0x4309e6c2ec40 cmdId.serialNumber=0x2be7)
YYYY-MM-DDTHH:MM:SS Wa(180) vmkwarning: cpu11:2097644)WARNING: NvmeUtil: 151: Error on Cmd(0x45bf0d66cd40) 0x9, CmdSN 0x2be7 from world 0 to component "t10.NVMe____######################______________________########" H:0xe D:0x0 P:0x0
YYYY-MM-DDTHH:MM:SS Wa(180) vmkwarning: cpu2:2100570)WARNING: WOBTREE: vmkio_unmap:1334: GOTO_ON_ERROR [195887410/0xbad0132/Device is permanently unavailable]

In vobd logs, disk is being repaired event logged due to I/O failures encountered and marked offline by vSAN:

YYYY-MM-DDTHH:MM:SS In (14) vobd [2097764] :[vSANCorrelator] 3348190290658us: [vob. vsan. lsom. storagepoolrepair] vSAN device ########-####-####-####-############ is being repaired due to I/0 failures and will be out of service until the repair is complete

YYYY-MM-DDTHH:MM:SS In(14) vobd[2097764]: [vSANCorrelator] 3348158977128us: [esx.problem.vob.vsan.lsom. storagepoolrepair] Device ########-####-####-####-############ is currently offline and is being repaired.

YYYY-MM-DDTHH:MM:SS In(14) vobd[2097764]: The event ([esx.problem. vob. vsan. lsom. storagepoolrepair] Device ########-####-####-####-############ is currently offline and is being repaired.) was sent immediately to hostd;

YYYY-MM-DDTHH:MM:SS In(14) vobd[2097764]: [vSANCorrelator] 3348234271599us: [vob.vsan.pdl. offline] vSAN device ########-####-####-####-############ has gone offline.

In the var/run/log/vsandevicemonitord.log , below entries indicating stuck I/O detected on the disk logged:

YYYY-MM-DDTHH:MM:SS In(14) vsandevicemonitord[2100512]: [938863764160] : Device t10.NVMe____######################__________############### state is DISK_UNDER_STUCK IO

Resolution

Engage the hardware vendor to assess the health of the physical disk. Errors can originate from underlying hardware issues that are not fully visible at the hypervisor level. The hardware vendor can perform detailed diagnostics on the drive, controller, and firmware to identify any latent or developing faults.

Feedback

thumb_up Yes

thumb_down No