vSAN capacity disk reports SMART impending failure and permanent device error
search cancel

vSAN capacity disk reports SMART impending failure and permanent device error

book

Article ID: 412701

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

A capacity disk in a vSAN OSA (Hybrid) disk group is reporting an impending failure and was marked first offline, then as a Permanent Error (PERM).

Disk mapping output (vdq -iH):

   DiskMapping[0]:
           SSD:  naa.50000######b800
            MD:  naa.50000######aba5
            MD:  naa.50000######ac15
            MD:  naa.50000######abc1
            MD:  naa.50000######ab51
            MD:  naa.50000######ac35

Log entries show the device transitioned to PERM error:

/var/log/vmkernel.log 

2025-10-01T16:46:19.614Z In(182) vmkernel: cpu56:2100228)LSOM: LSOMLogDiskEvent:8418: Disk Event permanent error for MD 52###9fd-####-####-####-d7e####2fcf3 (naa.50000######abc5:2)
2025-10-01T16:46:19.614Z Wa(180) vmkwarning: cpu56:2100228)WARNING: LSOM: LSOMEventNotify:8891: vSAN device 52###9fd-####-####-####-d7e####2fcf3  is under permanent error.

SMART data confirms the disk is in impending failure state:

esxcli storage core device smart get -d naa.50000######abc5

SMART Data for Disk : naa.50000######abc5
Parameter                       Value  Threshold Worst  Raw
-----------------------------------------------------------
Health Status                    IMPENDING FAILURE       N/A     N/A     N/A
Write Error Count                0       N/A     N/A     N/A
Read Error Count                 504     N/A     N/A     N/A
Power Cycle Count                50      N/A     N/A     N/A
Drive Temperature                19      N/A     N/A     N/A
------------------------------------------------------------

Environment

  • VMware vSAN 8.x

  • VMware vSAN OSA (Hybrid)

Cause

  • The capacity device naa.50000######abc5 encountered repeated read errors and medium errors (bad sectors).

  • vSAN attempted multiple retries and repair operations, but the retry threshold was exceeded.

  • Device was first marked offline, then escalated to Permanent Error (PERM).

  • SMART monitoring reports the drive in impending failure.

  • vSAN automatically initiated data evacuation to protect against data loss.

Validation

Medium Errors (Read Failures)

  • Sense Key [0x3] MEDIUM ERROR with READ RETRIES EXHAUSTED.
  • This means the drive has physical problems reading certain sectors (bad blocks developing).

Command Failures & Timeouts

  • Cmd 0x28 … Failed: Medium Error and later Host Status [0x5] ABORT.
  • Commands to the device are being aborted due to repeated timeouts.

2025-09-30T16:26:12.316Z Wa(180) vmkwarning: cpu1:2098534)WARNING: HPP: HppScsiThrottleLogForDevice:585: Cmd 0x28 (0x45b######b80, 0) to dev "naa.50000######abc5" on path "vmhba0:C0:T5:L0" Failed:
2025-09-30T16:26:12.316Z Wa(180) vmkwarning: cpu1:2098534)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x1. hppAction = 1
2025-09-30T16:26:12.316Z In(182) vmkernel: cpu1:2098534)ScsiDeviceIO: 4686: Cmd(0x45b######b80) 0x28, CmdSN 0xb860211 from world 0 to dev "naa.50000######abc5" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x1 Medium Error, LBA: 102####576

I/O Errors on Partition Reads

  • Failed read for naa.50000######abc5: I/O error
  • Even the partition table (protective MBR/GPT) cannot be read properly.

2025-09-30T16:26:49.147Z In(182) vmkernel: cpu2:2116277 opID=da3a69f2)Partition: 477: Failed read for "naa.50000######abc5": I/O error
2025-09-30T16:26:49.147Z In(182) vmkernel: cpu2:2116277 opID=da3a69f2)Partition: 1205: Failed to read protective mbr on "naa.50000######abc5" : I/O error
2025-09-30T16:26:49.147Z Wa(180) vmkwarning: cpu2:2116277 opID=da3a69f2)WARNING: Partition: 1387: Partition table read from device naa.50000######abc5 failed: I/O error
2025-09-30T16:26:49.147Z In(182) vmkernel: cpu2:2116277 opID=da3a69f2)ScsiDeviceIO: 6478: Command 0x1a (CmdSN 0x36###49, World 0) to device naa.50000######abc5 timed out: expiry time occurs 3ms in the past
2025-09-30T16:26:49.147Z Wa(180) vmkwarning: cpu2:2116277 opID=da3a69f2)WARNING: ScsiDeviceIO: 6723: Failed to issue command (0x1a) on device naa.50000######abc5: Timeout

vSAN Disk Repair Process

  • Device will be out of service until unmount-mount operation is complete
  • vSAN device is being repaired due to I/O failures
  • vSAN detects these errors and marks the device offline, trying to repair by resyncing data to other healthy devices.

2025-09-30T16:27:10.948Z In(182) vmkernel: cpu43:16728488)PLOG: PLOGHandleTransientErrorInt:5530: Throttled: Device: 52###9fd-####-####-####-d7e######cf3 will be out of service until unmount-mount operation is complete.
2025-09-30T16:27:10.948Z Wa(180) vmkwarning: cpu43:16728488)WARNING: PLOG: PLOGHandleTransientErrorInt:5612: vSAN device 52###9fd-####-####-####-d7e######cf3 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the devi$
2025-09-30T16:27:10.948Z In(182) vmkernel: cpu43:16728488)LSOMCommon: IORETRYCompleteIO:469: Throttled:  0x45e#####7900 IO type 16648 (READ) isOrdered:NO isSplit:YES isEncr:YES since 60001 msec status Maximum kernel-level retries exceeded

Latency Spikes

  • Latency shot up from ~5 ms (4994 µs) to nearly 1 second (956638 µs).
  • Then reduced again - typical of a failing disk with intermittent responsiveness.

2025-09-30T16:41:43.666Z In(182) vmkernel: cpu55:2100228)LSOM: LSOMNamespaceCheckLatency:398: Throttled: Latency 523bd9fd-####-####-####-d7e######cf3 1 18:26:##:##:#:#:0:0:1
2025-09-30T16:41:43.666Z In(182) vmkernel: cpu55:2100228)LSOM: LSOMNamespaceCheckLatency:428: Throttled: LatencyCum 523bd9fd-####-####-####-d7e######fcf3 1 31###81:242##93:22##989:491##96:32##23:14##6:28:1:3

2025-09-30T16:41:53.265Z Wa(180) vmkwarning: cpu35:2098536)WARNING: ScsiDeviceIO: 1780: Device naa.50000######abc5 performance has deteriorated. I/O latency increased from average value of 4994 microseconds to 956638 microseconds.
2025-09-30T16:41:53.271Z In(182) vmkernel: cpu5:2098530)ScsiDeviceIO: 1780: Device naa.50000######abc5 performance has improved. I/O latency reduced from 956638 microseconds to 13441 microseconds.

Permanent Device Error

  • Repair threshold (3)  reached and will be marked as PERM error
  • Device has exceeded retry limits. vSAN now considers it permanently failed.

2025-10-01T16:46:19.614Z In(182) vmkernel: cpu29:17011586)PLOG: PLOGHandleTransientErrorInt:5549: Repair threshold (3) for device: 52####fd-####-####-####-d7e######cf3 has been reached and will be marked as PERM error
2025-10-01T16:46:19.614Z Wa(180) vmkwarning: cpu1:2099350)WARNING: PLOG: PLOGPropagateErrorInt:4915: vSAN device 523bd9fd-####-####-####-d7e######cf3 is under permanent error.
2025-10-01T16:46:19.614Z In(182) vmkernel: cpu56:2100228)LSOM: LSOMLogDiskEvent:8418: Disk Event permanent error for MD 52####fd-####-####-####-d7e######cf3 (naa.50000######abc5:2)
2025-10-01T16:46:19.614Z Wa(180) vmkwarning: cpu56:2100228)WARNING: LSOM: LSOMEventNotify:8891: vSAN device 523bd9fd-####-####-####-d7e######cf3 is under permanent error.

SMART Impending Failure

  • [vSAN device naa.50000######abc5 smart health status is impending failure. It will be evacuated and unmounted, consider replacing it.]
  • The disk’s SMART health check predicts imminent failure.

2025-10-02T05:08:54.611Z In(14) vobd[2097763]:  [vSANCorrelator] 4657######206us: [esx.problem.vob.vsan.lsom.devicewithsmartfailure] vSAN device naa.50000######abc5 smart health status is impending failure. It will be evacuated and unmounted, consider replacing it.

Resolution

  • Replace the failing disk after taking host into maintenance mode (safest approach)

Note : Ensure data evacuation has completed before removing the drive physically.

  • After replacement, add the new disk to the vSAN disk group to restore capacity and redundancy.

If deduplication and compression are enabled, 

  • Take host into maintenance mode.
  • Remove the disk group with failing disk
  • Replace the disk.
  • Recreate the disk group
  • Exit host from maintenance mode
  • Allow vSAN to resync.