Permanent disk failure in vSAN
search cancel

Permanent disk failure in vSAN

book

Article ID: 391710

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • In vCenter UI, the Skyline health (vSAN cluster > Monitor > vSAN > Skyline Health > Physical Disk > Operation health) may report permanent disk failure for the vSAN disk as in the figure below:

  • In "/var/log/vobd.log" of ESXi host will also report the disk is under permanent error:
YYYY-MM-DDTHH:MM:SSZ:[vSANCorrelator] 12219239351514u3:[esx.problem.vob.vsan.pdl.offline] vSAN device ########-####-####-####-########### has gone offline.
YYYY-MM-DDTHH:MM:SSZ:[vSANCorrelator] 12219239351514u3:[esx.problem.vob.vsan.lsom.devicerepair] Device e########-####-####-####-########### is in offline state and is getting repaired.
YYYY-MM-DDTHH:MM:SSZ:[vSANCorrelator] 12219239351514u3:[vob.vsan.pdl.offline] vSAN device e########-####-####-####-########### has gone offline.
YYYY-MM-DDTHH:MM:SSZ:[vSANCorrelator] 12219239351514u3:[esx.problem. vob.vsan.pdl.offline] vSAN device e########-####-####-####-########### has gone offline.
YYYY-MM-DDTHH:MM:SSZ:[vSANCorrelator] 12219239351514u3:[vob. vsan.lsom. diskerror] vSAN device e ########-####-####-####-########### is under permanent error.
   
                                                                          (or)

YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2098148]:  [vSANCorrelator] 9883129325851us: [vob.vsan.pdl.offline] vSAN device 5####3c-#####5##d-d####3-######1 has gone offline.
YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2098148]:  [APDCorrelator] 9882990619157us: [esx.problem.storage.apd.start] Device or filesystem with identifier [##########] has entered the All Paths Down state.
YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2098148]:  [vSANCorrelator] 9882990619196us: [esx.problem.vob.vsan.pdl.offline] vSAN device 5####c-1626-####-####-f01#####1 has gone offline.
YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2098148]:  [psastorCorrelator] 9882990619971us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device #############. Path vmhba0:C0:T0:L0 is down. Affected datastores: Unknown.
YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2098148]:  [psastorCorrelator] 9883129325791us: [vob.psastor.device.state.permanentloss] Device :eui.############ has been removed or is permanently inaccessible.
YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2098148]:  [psastorCorrelator] 9882990620278us: [esx.problem.psastor.device.state.permanentloss] Device: eui.############### has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.

  • Physical disk failure can be confirmed by logging in to hardware interface.

    Example from iLO:



  • Incase the vSAN disks are identified as unhealthy and they are going to face failure eventually. In "var/run/log/vobd.log", you will see below entries -

    YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2097812]:  [vSANCorrelator] 690804744us: [esx.problem.vob.vsan.lsom.diskunhealthy] vSAN device 528c1c3d-ddd2-7210-c721-############ is unhealthy.
    YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2097812]:  [vSANCorrelator] 690811954us: [esx.problem.vob.vsan.lsom.diskunhealthy] vSAN device 5207b712-e8a7-2b82-3283-############ is unhealthy.

  • To check the Impending failure state of the unhealthy disks.

    [root@esxi01:~] localcli storage core device smart get -d naa.###################
    Parameter                 Value              Threshold  Worst  Raw
    ------------------------  -----------------  ---------  -----  ---
    Health Status             IMPENDING FAILURE  N/A        N/A    N/A
    Media Wearout Indicator   0                  100        N/A    N/A
    Write Error Count         0                  N/A        N/A    N/A
    Read Error Count          0                  N/A        N/A    N/A
    Power Cycle Count         0                  N/A        N/A    N/A
    Reallocated Sector Count  0                  N/A        N/A    N/A
    Drive Temperature         27                 N/A        N/A    N/A
    Write Sectors TOT Count   5239103362346      N/A        N/A    N/A
    Read Sectors TOT Count    2353051044834      N/A        N/A    N/A
    Program Fail Count        0                  N/A        N/A    N/A
    Erase Fail Count          0                  N/A        N/A    N/A

Environment

VMware vSphere vSAN 

Cause

This is caused due to disk(storage device) issue.

vSAN will mark a disk offline when it encounters I/O failure, preventing further operations on the affected disk.

Resolution

If the disk(storage device) is present and experiencing permanent errors, it may be due to the driver/firmware of the controller as well. Please review the hardware compatibility and apply the appropriate driver/firmware. 

If the disk/s still see the issue, please involve hardware vendor to review the disk health and assist with the next steps.

  • If its vSAN OSA (Capacity disk) follow below instructions

Replace a Capacity Device in vSAN OSA Cluster

  • If its vSAN OSA (Cache disk) 

Replace a Flash Caching Device on a Host in vSAN Cluster

  • If its vSAN ESA follow below instructions to replace the disk 

Replace a Storage Pool Device in vSAN ESA Cluster

 

Additional Information

vSANでの永続的なディスク障害