Identifying and replacing a failed cache or capacity disk in vSAN OSA disk group when vSAN deduplication is enabled
search cancel

Identifying and replacing a failed cache or capacity disk in vSAN OSA disk group when vSAN deduplication is enabled

book

Article ID: 327008

calendar_today

Updated On:

Products

VMware vSAN 7.x VMware vSAN 8.x VMware vSAN 6.x

Issue/Introduction

When using VMware vSAN with deduplication enabled, any disk failure will result in the failure of the entire disk group it belongs to. 

The related vSAN Skyline Health test for "Operation health" will reflect that the entire Disk Group is offline.


This is further demonstrated via Configure > vSAN > Disk Management:



As we can see from the Screenshots, we have one Disk which is marked as "Absent".
Due to the nature of the event, we only see the Disk UUID but not the original Disk name anymore (= e.g. naa.xxxxx).

See here for additional information:
 

Environment

VMware vSAN OSA

Cause

vSAN deduplication occurs at the disk group level cluster wide.  As a result, the failure of a single disk in the disk group results in the failure of the entire disk group. The UI reflects this disk group failure but does not reveal the device identifying information about the device that triggered the disk group failure.

Resolution

To identify the specific device that caused the failure:

  1.  Log in to the applicable ESXi host via SSH or KVM/physical console.
  2.  List vSAN disks using this command:

    # esxcli vsan storage list |less
  3. You will see output like this for a failed disk
Unknown:
   Device: Unknown
   Display Name: Unknown
   Is SSD: false
   VSAN UUID: ########-########-####-####-####-########226a
   VSAN Disk Group UUID:
   VSAN Disk Group Name:
   Used by this host: false
   In CMMDS: false
   On-disk format version: -1
   Deduplication: false
   Compression: false
   Checksum:
   Checksum OK: false
   Is Capacity Tier: false
   Encryption Metadata Checksum OK: true
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: false
   Creation Time: Unknown

    4.  You can also use the command vdq -iH to list the disk mappings on the host to find the failed disk. If the disk is listed as a UUID and not the disk identifier then vSAN has failed out the disk as seen below:

 

[root@esx01:~] vdq -iH
Mappings:
   DiskMapping[0]:
           SSD:  naa.58ce########fec5
            MD:  naa.58ce########a7f9
            MD:  naa.58ce#######bbd1
            MD:  naa.58ce#######02a5
            MD:  naa.58ce########9d69
            MD:  naa.58ce########aaf5
            MD:  naa.58ce########a7e5
            MD:  ########-########-####-####-####-########226a

   5. To identify the display name of the disk and if the failure is recent enough run the following command:


grep ########-########-####-####-####-########226a /var/log/vmkernel.log

you should see similar output as below:


2021-01-09T05:45:41.638Z cpu0:7053521)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD ########-########-####-####-####-########226a (naa.58cexxxxxxxxaad9:2)

The Disk Group will have to be removed first with the option "No Data migration". Once the disk group has been removed. Shutdown the host and physically replace the disk(s). Power the host back online and using the disk manager in vCenter recreate the disk group.
 
If disk removal task fails in the vCenter UI vSAN disk manager - reboot the host and try again. Always try Maintenance Mode with Ensure Access first, and if the host cannot go into Maintenance Mode then the host will need to be forcefully entered into Maintenance Mode with "No Data Migration" . If you are uncertain of the impact due to object health states in vSAN Skyline Health then please contact Global Support for assistance with measuring impact.
 
 
 


Additional Information

 

If necessary, we can get the path information about the failed device to further assist with identification.
From the ESXi Shell, run this command:

# esxcfg-mpath -bd <naa identifier device>

For the example in the Resolution section, the command and example output is:

# esxcfg-mpath -bd naa.58ce########aad9
naa.58cexxxxxxxxaad9 : VMware Serial Attached SCSI Disk (naa.58cexxxxxxxxaad9)
vmhba1:C0:T1:L0 LUN:0 state:active sas Adapter: 5005########8c11 Target: 5000########02af

The device is target #1 on vmhba1.

We can also get the physical location of the device.
From the ESXi Shell, run these commands:

# esxcli storage core device physical get -d <naa identifier device>
# esxcli storage core device raid list -d <naa identifier device>

The command and example output is:

# esxcli storage core device physical get -d naa.58ce########aad9
 Physical Location: enclosure 2, slot 5

Or 

# esxcli storage core device raid list -d naa.58ce########aad9
 Physical Location: enclosure 2, slot 5




Note: The above commands may not work with certain drivers as the vSAN disk serviceability plugin is not coded for all drivers. The current supported list is below:
hpsa
nhpsa
iavmd
nvme_pcie
lsi_mr3
lsi_msgpt3
lsi_msgpt35
smartpqi 

If your driver is not listed then work with your hardware vendor to open an engineering to engineering case so we can work together to update the plugin code to interface with those drivers.

You may see one of the below errors if your driver is not listed above when running these commands which reflects we can't interact with the device to either pull the required information or turn on/off the LED:

esxcli storage core device physical get -d naa.6589cfc########93491
Unable to get location for device naa.6589cfc########93491: No LSU plugin can manage this device.

esxcli storage core device raid list -d naa.5000########4c2f
Unable to get location for device naa.5000########4c2f: Can not manage device!