vSAN capacity disk reports SMART impending failure and permanent device error

Products

VMware vSAN

Issue/Introduction

A capacity disk in a vSAN OSA (Hybrid) disk group is reporting an impending failure and was marked first offline, then as a Permanent Error (PERM).

Disk mapping output (vdq -iH):

DiskMapping[0]:
SSD: naa.50000######b800
MD: naa.50000######aba5
MD: naa.50000######ac15
MD: naa.50000######abc1
MD: naa.50000######ab51
MD: naa.50000######ac35

Log entries show the device transitioned to PERM error:

/var/log/vmkernel.log

2025-10-01T16:46:19.614Z In(182) vmkernel: cpu56:2100228)LSOM: LSOMLogDiskEvent:8418: Disk Event permanent error for MD 52###9fd-####-####-####-d7e####2fcf3 (naa.50000######abc5:2)
2025-10-01T16:46:19.614Z Wa(180) vmkwarning: cpu56:2100228)WARNING: LSOM: LSOMEventNotify:8891: vSAN device 52###9fd-####-####-####-d7e####2fcf3 is under permanent error.

SMART data confirms the disk is in impending failure state:

esxcli storage core device smart get -d naa.50000######abc5

SMART Data for Disk : naa.50000######abc5
Parameter Value Threshold Worst Raw
-----------------------------------------------------------
Health Status IMPENDING FAILURE N/A N/A N/A
Write Error Count 0 N/A N/A N/A
Read Error Count 504 N/A N/A N/A
Power Cycle Count 50 N/A N/A N/A
Drive Temperature 19 N/A N/A N/A
------------------------------------------------------------

Environment

VMware vSAN 8.x
VMware vSAN OSA (Hybrid)

Cause

The capacity device naa.50000######abc5 encountered repeated read errors and medium errors (bad sectors).
vSAN attempted multiple retries and repair operations, but the retry threshold was exceeded.
Device was first marked offline, then escalated to Permanent Error (PERM).
SMART monitoring reports the drive in impending failure.
vSAN automatically initiated data evacuation to protect against data loss.

Validation

Medium Errors (Read Failures)

Sense Key [0x3] MEDIUM ERROR with READ RETRIES EXHAUSTED.
This means the drive has physical problems reading certain sectors (bad blocks developing).

Command Failures & Timeouts

Cmd 0x28 … Failed: Medium Error and later Host Status [0x5] ABORT.
Commands to the device are being aborted due to repeated timeouts.

2025-09-30T16:26:12.316Z Wa(180) vmkwarning: cpu1:2098534)WARNING: HPP: HppScsiThrottleLogForDevice:585: Cmd 0x28 (0x45b######b80, 0) to dev "naa.50000######abc5" on path "vmhba0:C0:T5:L0" Failed:
2025-09-30T16:26:12.316Z Wa(180) vmkwarning: cpu1:2098534)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x1. hppAction = 1
2025-09-30T16:26:12.316Z In(182) vmkernel: cpu1:2098534)ScsiDeviceIO: 4686: Cmd(0x45b######b80) 0x28, CmdSN 0xb860211 from world 0 to dev "naa.50000######abc5" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x1 Medium Error, LBA: 102####576

I/O Errors on Partition Reads

Failed read for naa.50000######abc5: I/O error
Even the partition table (protective MBR/GPT) cannot be read properly.

2025-09-30T16:26:49.147Z In(182) vmkernel: cpu2:2116277 opID=da3a69f2)Partition: 477: Failed read for "naa.50000######abc5": I/O error
2025-09-30T16:26:49.147Z In(182) vmkernel: cpu2:2116277 opID=da3a69f2)Partition: 1205: Failed to read protective mbr on "naa.50000######abc5" : I/O error
2025-09-30T16:26:49.147Z Wa(180) vmkwarning: cpu2:2116277 opID=da3a69f2)WARNING: Partition: 1387: Partition table read from device naa.50000######abc5 failed: I/O error
2025-09-30T16:26:49.147Z In(182) vmkernel: cpu2:2116277 opID=da3a69f2)ScsiDeviceIO: 6478: Command 0x1a (CmdSN 0x36###49, World 0) to device naa.50000######abc5 timed out: expiry time occurs 3ms in the past
2025-09-30T16:26:49.147Z Wa(180) vmkwarning: cpu2:2116277 opID=da3a69f2)WARNING: ScsiDeviceIO: 6723: Failed to issue command (0x1a) on device naa.50000######abc5: Timeout

vSAN Disk Repair Process

Device will be out of service until unmount-mount operation is complete
vSAN device is being repaired due to I/O failures
vSAN detects these errors and marks the device offline, trying to repair by resyncing data to other healthy devices.

2025-09-30T16:27:10.948Z In(182) vmkernel: cpu43:16728488)PLOG: PLOGHandleTransientErrorInt:5530: Throttled: Device: 52###9fd-####-####-####-d7e######cf3 will be out of service until unmount-mount operation is complete.
2025-09-30T16:27:10.948Z Wa(180) vmkwarning: cpu43:16728488)WARNING: PLOG: PLOGHandleTransientErrorInt:5612: vSAN device 52###9fd-####-####-####-d7e######cf3 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the devi$
2025-09-30T16:27:10.948Z In(182) vmkernel: cpu43:16728488)LSOMCommon: IORETRYCompleteIO:469: Throttled: 0x45e#####7900 IO type 16648 (READ) isOrdered:NO isSplit:YES isEncr:YES since 60001 msec status Maximum kernel-level retries exceeded

Latency Spikes

Latency shot up from ~5 ms (4994 µs) to nearly 1 second (956638 µs).
Then reduced again - typical of a failing disk with intermittent responsiveness.

2025-09-30T16:41:43.666Z In(182) vmkernel: cpu55:2100228)LSOM: LSOMNamespaceCheckLatency:398: Throttled: Latency 523bd9fd-####-####-####-d7e######cf3 1 18:26:##:##:#:#:0:0:1
2025-09-30T16:41:43.666Z In(182) vmkernel: cpu55:2100228)LSOM: LSOMNamespaceCheckLatency:428: Throttled: LatencyCum 523bd9fd-####-####-####-d7e######fcf3 1 31###81:242##93:22##989:491##96:32##23:14##6:28:1:3

2025-09-30T16:41:53.265Z Wa(180) vmkwarning: cpu35:2098536)WARNING: ScsiDeviceIO: 1780: Device naa.50000######abc5 performance has deteriorated. I/O latency increased from average value of 4994 microseconds to 956638 microseconds.
2025-09-30T16:41:53.271Z In(182) vmkernel: cpu5:2098530)ScsiDeviceIO: 1780: Device naa.50000######abc5 performance has improved. I/O latency reduced from 956638 microseconds to 13441 microseconds.

Permanent Device Error

Repair threshold (3) reached and will be marked as PERM error
Device has exceeded retry limits. vSAN now considers it permanently failed.

2025-10-01T16:46:19.614Z In(182) vmkernel: cpu29:17011586)PLOG: PLOGHandleTransientErrorInt:5549: Repair threshold (3) for device: 52####fd-####-####-####-d7e######cf3 has been reached and will be marked as PERM error
2025-10-01T16:46:19.614Z Wa(180) vmkwarning: cpu1:2099350)WARNING: PLOG: PLOGPropagateErrorInt:4915: vSAN device 523bd9fd-####-####-####-d7e######cf3 is under permanent error.
2025-10-01T16:46:19.614Z In(182) vmkernel: cpu56:2100228)LSOM: LSOMLogDiskEvent:8418: Disk Event permanent error for MD 52####fd-####-####-####-d7e######cf3 (naa.50000######abc5:2)
2025-10-01T16:46:19.614Z Wa(180) vmkwarning: cpu56:2100228)WARNING: LSOM: LSOMEventNotify:8891: vSAN device 523bd9fd-####-####-####-d7e######cf3 is under permanent error.

SMART Impending Failure

[vSAN device naa.50000######abc5 smart health status is impending failure. It will be evacuated and unmounted, consider replacing it.]
The disk’s SMART health check predicts imminent failure.

2025-10-02T05:08:54.611Z In(14) vobd[2097763]: [vSANCorrelator] 4657######206us: [esx.problem.vob.vsan.lsom.devicewithsmartfailure] vSAN device naa.50000######abc5 smart health status is impending failure. It will be evacuated and unmounted, consider replacing it.

Resolution

Replace the failing disk after taking host into maintenance mode (safest approach)

Note : Ensure data evacuation has completed before removing the drive physically.

After replacement, add the new disk to the vSAN disk group to restore capacity and redundancy.

If deduplication and compression are enabled,

Take host into maintenance mode.
Remove the disk group with failing disk
Replace the disk.
Recreate the disk group
Exit host from maintenance mode
Allow vSAN to resync.