Alert for failed vSAN disk on Skyline health, but hardware monitoring software shows disk as healthy
search cancel

Alert for failed vSAN disk on Skyline health, but hardware monitoring software shows disk as healthy

book

Article ID: 417749

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Skyline health check on vCenter shows one or more failed disks.

e.g. Error: vSAN physical disk alarm 'Operation'

 

However, the hardware monitoring tool, e.g. iDRAC, iLO, or similar... shows the disk as healthy.

Environment

vSAN 8.x/9.x

Cause

The disk may have failed, which has not been picked up by the hardware yet, or it may be not be a hardware problem.

Resolution

Check the logs on the ESXi host for error messages relating to the disk, or disks, triggering the alerts.

 

Some examples that indicate a potential hardware failure:

 

/var/run/log/vmkernel.log:

In(182) vmkernel: cpu56:2098250)ScsiDeviceIO: 4686: Cmd(0x45bcd3e1c400) 0x28, CmdSN 0x41bc from world 0 to dev "naa.XXXXXXXXXXXXXXXX" failed H:0x7 D:0x0 P:0x0

 

/var/run/log/hostd.log:

In(166) Hostd[2107064]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 108173 : Device naa.XXXXXXXXXXXXXXXX has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.

 

/var/run/log/vmkwarning.log:

Wa(180) vmkwarning: cpu10:2097936)WARNING: HPP: HppDeviceUpdateState:5242: Device 'naa.XXXXXXXXXXXXXXXX' is changing to 'permanent device loss' from 'on'.

 

 

If any messages similar to the above are seen, schedule a cold reboot of the ESXi host.

1. Put the ESXi host into Maintenance Mode with the Ensure Accessibility option before reboot.

2. When the ESXi host is in Maintenance Mode, power it down completely.

3. Wait for five minutes.

4. Power the ESXi host back on.

 

The cold reboot will force the physical disks to be reinitialized. If there is a hardware fail on a disk, this will be picked up here, and the hardware management tool will flag the disk(s) as failed.

- Replacement of the failed disk(s) will then need to be scheduled.

 

If the disks all initialize successfully, check the Skyline health view after the ESXi host is back up to confirm the disk is no longer flagging as a failure.