Alert for failed vSAN disk on Skyline health, but hardware monitoring software shows disk as healthy
search cancel

Alert for failed vSAN disk on Skyline health, but hardware monitoring software shows disk as healthy

book

Article ID: 417749

calendar_today

Updated On:

Products

VMware vSAN VMware vSAN 8.x VMware vSAN 6.x VMware vSAN 7.x

Issue/Introduction

  • Skyline Health Check in vCenter shows one or more failed disks (e.g., Error: vSAN physical disk alarm 'Operation').
  • VMs with components on the affected disk show vCPU spikes

  • On the ESXi host, you may see messages in the logs similar to the following:

    • [/var/log/vmkernel.log]

      cpu56:2098250)ScsiDeviceIO: 4686: Cmd(0x45bcd3e1c400) 0x28, CmdSN 0x41bc from world 0 to dev "naa.XXXXXXXXXXXXXXXX" failed H:0x7 D:0x0 P:0x0
-or-

cpu38:2144121)WARNING: PLOG: PLOGValidateDisk:3555: Disk naa.XXXXXXXXXXXXXXXX:1 0x4503008b61a8 is unhealthy state (1), degraded

    • [/var/log/hostd.log] 
In(166) Hostd[2107064]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 108173 : Device naa.XXXXXXXXXXXXXXXX has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.
 
    • [/var/run/log/vmkwarning.log]
Wa(180) vmkwarning: cpu10:2097936)WARNING: HPP: HppDeviceUpdateState:5242: Device 'naa.XXXXXXXXXXXXXXXX' is changing to 'permanent device loss' from 'on'.
 
  •  Hardware monitoring tools (e.g., iDRAC, iLO) report the disk status as Healthy.

Environment

  • VMware vSAN (All Versions)

Cause

  • The disk is failing or has failed. 
  • vSAN's Dying Disk Handling (DDH) monitors throughput and latency; it often detects internal disk errors or performance degradation before the physical hardware sensors trigger a total failure alarm in the BMC (iDRAC/iLO).

Resolution

  • To confirm the hardware state and attempt recovery, schedule a cold reboot of the ESXi host:
1. Put the ESXi host into Maintenance Mode with the Ensure Accessibility option before reboot.
2. When the ESXi host is in Maintenance Mode, power it down completely.
3. Wait for five minutes.
4. Power the ESXi host back on.
 
NOTE: The cold reboot will force the physical disks to be reinitialized. If there is a hardware failure on a disk, this should be picked up by the hardware management tool and flag the disk(s) as failed.
  • If the disk initializes successfully: Monitor the Skyline Health view after the host is back online. If the alert is gone, continue to monitor the drive closely for recurring latency or errors.

  • If the disk fails to initialize: Please contact your hardware vendor for a physical disk replacement.

Additional Information