Identifying and replacing a failed cache or capacity disk in vSAN OSA disk group when vSAN deduplication is enabled

Products

VMware vSAN 7.x VMware vSAN 8.x VMware vSAN 6.x

Issue/Introduction

When using VMware vSAN with deduplication enabled, any disk failure will result in the failure of the entire disk group it belongs to.

The related vSAN Skyline Health test for "Operation health" will reflect that the entire Disk Group is offline.

This is further demonstrated via Configure > vSAN > Disk Management:

As we can see from the Screenshots, we have one Disk which is marked as "Absent".
Due to the nature of the event, we only see the Disk UUID but not the original Disk name anymore (= e.g. naa.xxxxx).

See here for additional information:

vSAN 7 Using Deduplication and Compression

vSAN 8 Using Deduplication and Compression in vSAN Cluster

Environment

VMware vSAN OSA

Cause

vSAN deduplication occurs at the disk group level cluster wide. As a result, the failure of a single disk in the disk group results in the failure of the entire disk group. The UI reflects this disk group failure but does not reveal the device identifying information about the device that triggered the disk group failure.

Resolution

To identify the specific device that caused the failure:

Log in to the applicable ESXi host via SSH or KVM/physical console.
List vSAN disks using this command:

# esxcli vsan storage list |less
You will see output like this for a failed disk

Unknown:
   Device: Unknown
   Display Name: Unknown
   Is SSD: false
   VSAN UUID: ########-########-####-####-####-########226a
   VSAN Disk Group UUID:
   VSAN Disk Group Name:
   Used by this host: false
   In CMMDS: false
   On-disk format version: -1
   Deduplication: false
   Compression: false
   Checksum:
   Checksum OK: false
   Is Capacity Tier: false
   Encryption Metadata Checksum OK: true
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: false
   Creation Time: Unknown

4. You can also use the command vdq -iH to list the disk mappings on the host to find the failed disk. If the disk is listed as a UUID and not the disk identifier then vSAN has failed out the disk as seen below:

[root@esx01:~] vdq -iH
Mappings:
DiskMapping[0]:
SSD: naa.58ce########fec5
MD: naa.58ce########a7f9
MD: naa.58ce#######bbd1
MD: naa.58ce#######02a5
MD: naa.58ce########9d69
MD: naa.58ce########aaf5
MD: naa.58ce########a7e5
MD: ########-########-####-####-####-########226a

5. To identify the display name of the disk and if the failure is recent enough run the following command:

grep ########-########-####-####-####-########226a /var/log/vmkernel.log

you should see similar output as below:

2021-01-09T05:45:41.638Z cpu0:7053521)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD ########-########-####-####-####-########226a (naa.58cexxxxxxxxaad9:2)

The Disk Group will have to be removed first with the option "No Data migration". Once the disk group has been removed. Shutdown the host and physically replace the disk(s). Power the host back online and using the disk manager in vCenter recreate the disk group.

If disk removal task fails in the vCenter UI vSAN disk manager - reboot the host and try again. Always try Maintenance Mode with Ensure Access first, and if the host cannot go into Maintenance Mode then the host will need to be forcefully entered into Maintenance Mode with "No Data Migration" . If you are uncertain of the impact due to object health states in vSAN Skyline Health then please contact Global Support for assistance with measuring impact.

Additional Information

If necessary, we can get the path information about the failed device to further assist with identification.
From the ESXi Shell, run this command:

# esxcfg-mpath -bd <naa identifier device>

For the example in the Resolution section, the command and example output is:

# esxcfg-mpath -bd naa.58ce########aad9
naa.58cexxxxxxxxaad9 : VMware Serial Attached SCSI Disk (naa.58cexxxxxxxxaad9)
vmhba1:C0:T1:L0 LUN:0 state:active sas Adapter: 5005########8c11 Target: 5000########02af

The device is target #1 on vmhba1.

We can also get the physical location of the device.
From the ESXi Shell, run these commands:

# esxcli storage core device physical get -d <naa identifier device>
# esxcli storage core device raid list -d <naa identifier device>

The command and example output is:

# esxcli storage core device physical get -d naa.58ce########aad9
Physical Location: enclosure 2, slot 5

Or

# esxcli storage core device raid list -d naa.58ce########aad9
Physical Location: enclosure 2, slot 5

Note: The above commands may not work with certain drivers as the vSAN disk serviceability plugin is not coded for all drivers. The current supported list is below:
hpsa
nhpsa
iavmd
nvme_pcie
lsi_mr3
lsi_msgpt3
lsi_msgpt35
smartpqi

If your driver is not listed then work with your hardware vendor to open an engineering to engineering case so we can work together to update the plugin code to interface with those drivers.

You may see one of the below errors if your driver is not listed above when running these commands which reflects we can't interact with the device to either pull the required information or turn on/off the LED:

esxcli storage core device physical get -d naa.6589cfc########93491
Unable to get location for device naa.6589cfc########93491: No LSU plugin can manage this device.

esxcli storage core device raid list -d naa.5000########4c2f
Unable to get location for device naa.5000########4c2f: Can not manage device!

See here for additional information:
With Deduplication & Compression enabled: Adding or Removing Disks
Remove Disk Groups or Devices from vSAN
Working with Individual Devices