vSAN - troubleshooting disk failure issues
search cancel

vSAN - troubleshooting disk failure issues

book

Article ID: 390534

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSAN 6.x VMware vSAN 7.x VMware vSAN 8.x

Issue/Introduction

This is a general troubleshooting procedure to help identifying if there is a problem with a physical disk in vSAN Clusters

Environment

  • VMware vSAN 6.7.x
  • VMware vSAN 7.x
  • VMware vSAN 8.x

Cause

  • vSAN Disk failures

Resolution

Follow the below steps to troubleshoot disk failure in a vSAN environment :

 

From Web UI : 

 

1. Check for vSAN Physical Disk status :

    • Inventory > Host and Clusters > vSAN Cluster > Configure > vSAN > Disk Management

2. Select the affected host and then expand the view disk section. Verify the disk status and if it is reported as 

Unhealthy
Unmounted 
Permanent Disk Failure
Disk Down
Disk Absent

         3. Check for any disk-related alarms triggered from the vSAN Skyline Health section

    • Inventory > Host and Clusters > vSAN Cluster > Monitor > vSAN > Skyline Health > Physical disk

4. Check disk status from the affected host's Storage Devices list:

    • Inventory > Host and Clusters > vSAN Cluster > Affected vSAN ESXi Host > Configure > Storage > Storage Devices

         5. Verify if there is a Resync happening:                    

    • Inventory > Host and Clusters > vSAN Cluster > Monitor > vSAN > Resyncing Objects

NOTE: Resync could indicate that data is being evacuated from an affected disk or disk group. Further investigation is needed to determine if the affected disk is ready to be removed or replaced.

6. Verify the status of vSAN Objects: 

    • Inventory > Host and Clusters > vSAN Cluster > Monitor > vSAN > Skyline Health > Data > vSAN object health

 

From CLI :

1. Connect over SSH to the affected host and run the following commands: 

# vdq -qH

2.Check on the "IsPDL" (permanent device loss) parameter. If it is equal 1, the disk is lost.


 
DiskResults:
 DiskResult[0]:
 Name: naa.600508b1001c4b820b4d80f9f8acfa95
 VSANUUID: 5294bbd8-67c4-c545-3952-7711e365f7fa
 State: In-use for VSAN
 ChecksumSupport: 0
 Reason: Non-local disk
 IsSSD?: 0
 IsCapacityFlash?: 0
 IsPDL?: 0
 <<truncated>>
 DiskResult[18]:
 Name:
 VSANUUID: 5227c17e-ec64-de76-c10e-c272102beba7
 State: In-use for VSAN
 ChecksumSupport: 0
 Reason: None
 IsSSD?: 0
 IsCapacityFlash?: 0
 IsPDL?: 1

3. Check if there is a missing disk from the disk group. 

# vdq -iH

Mappings:
   DiskMapping[0]:
           SSD:  eui.6bfe4897c023247c000c2963f82a877c
            MD:  mpx.vmhba2:C0:T1:L0
            MD:  mpx.vmhba2:C0:T2:L0

4. Check on the "In CMMDS" parameter. If false, then communication is lost to disk.

# esxcli vsan storage list | grep -i cmmds

   In CMMDS: true
   In CMMDS: true
   In CMMDS: false

         # esxcli vsan storage list | less 

      Device: Unknown 
   Display Name: Unknown 
   Is SSD: false
   VSAN UUID: 52bf19bd-1f9d-771b-ff4d-515281fee853
   VSAN Disk Group UUID: 
   VSAN Disk Group Name:
   Used by this host: false
   In CMMDS: false
   On-disk format version: 20
   Deduplication: false
   Compression: false
   Checksum: 
   Checksum OK: true
   Is Capacity Tier: false
   Encryption Metadata Checksum OK: true
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: true
   Creation Time: Wed Feb 12 22:53:23 2025

 

5. Check the physical location of the drive using below command :

# esxcli storage core device physical get -d <disk name>

esxcli storage core device physical get -d naa.xxxx
esxcli storage core device physical get -d naa.xxxx

 
   Physical Location: enclosure 25564 slot 0
   Physical Location: enclosure 25565 slot 1

        6. vSAN logs to check for storage-related issues:

    • /var/log/vmkernel.log 
    • /var/log/vobd.log 
    • /var/log/vsandevicemonitord.log