vSAN Deduplication enabled -- Identifying Failed Disk

Products

VMware vSAN

Issue/Introduction

When using VMware vSAN with Deduplication enabled, any Disk failure will result in the failure of the entire Disk group it belongs to.
The related vSAN Healthcheck "Operation health" will reflect that the entire Disk Group is offline.

This is further demonstrated via Configure > vSAN > Disk Management:

As we can see from the Screenshots, we have one Disk which is marked as "Absent".
Due to the nature of the event, we only see the Disk UUID but not the original Disk name anymore (= e.g. naa.xxxxx).

See here for additional information:
Using Deduplication and Compression

Environment

VMware vSAN (All Versions)

Cause

vSAN deduplication occurs at the disk group level cluster wide. As a result, the failure of a single disk in the disk group results in the failure of the entire disk group. The UI reflects this disk group failure but does not reveal the device identifying information about the device that triggered the disk group failure.

Resolution

To identify the specific device that caused the failure:

1. Log in to the applicable ESXi host via SSH or KVM/physical console.
2. List vSAN disks using this command:

# esxcli vsan storage list|less

3. You will see output like this for a failed disk

Unknown:
   Device: Unknown
   Display Name: Unknown
   Is SSD: false
   VSAN UUID: ########-########-####-####-####-########226a
   VSAN Disk Group UUID:
   VSAN Disk Group Name:
   Used by this host: false
   In CMMDS: false
   On-disk format version: -1
   Deduplication: false
   Compression: false
   Checksum:
   Checksum OK: false
   Is Capacity Tier: false
   Encryption Metadata Checksum OK: true
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: false
   Creation Time: Unknown

4. You can also use the command vdq -iH to list the disk mappings on the host to find the failed disk. If the disk is listed as a UUID and not the disk identifier then vSAN has failed out the disk as seen below:
[root@esx01:~] vdq -iH
Mappings:
   DiskMapping[0]:
           SSD: naa.58ce########fec5
            MD: naa.58ce########a7f9
            MD: naa.58ce#######bbd1
            MD: naa.58ce#######02a5
            MD: naa.58ce########9d69
            MD: naa.58ce########aaf5
            MD: naa.58ce########a7e5
            MD: ########-########-####-####-####-########226a

5. To identify the display name of the disk and if the failure is recent enough run the following command:
grep ########-########-####-####-####-########226a /var/log/vmkernel.log
you should see similar output as below:
2021-01-09T05:45:41.638Z cpu0:7053521)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD ########-########-####-####-####-########226a (naa.58cexxxxxxxxaad9:2)

Note: The Disk Group must be removed first with the option "No Data migration"

(as the Disk Group is effectively lost), then replace the failed disk and re-create the Disk Group.

Additional Information

If necessary, we can get the path information about the failed device to further assist with identification.
From the ESXi Shell, run this command:

# esxcfg-mpath -bd <naa identifier device>

For the example in the Resolution section, the command and example output is:

# esxcfg-mpath -bd naa.58ce########aad9
naa.58cexxxxxxxxaad9 : VMware Serial Attached SCSI Disk (naa.58cexxxxxxxxaad9)
vmhba1:C0:T1:L0 LUN:0 state:active sas Adapter: 5005########8c11 Target: 5000########02af

The device is target #1 on vmhba1.

We can also get the physical location of the device.
From the ESXi Shell, run these commands:

# esxcli storage core device physical get -d <naa identifier device>
# esxcli storage core device raid list -d <naa identifier device>

The command and example output is:

# esxcli storage core device physical get -d naa.58ce########aad9
Physical Location: enclosure 2, slot 5

Or

# esxcli storage core device raid list -d naa.58ce########aad9
Physical Location: enclosure 2, slot 5

Another option is to turn on the LED locator on the disk see Turn Locator LEDs on vSAN storage devices on/off

Note: The above commands may not work with certain drivers as the vSAN disk serviceability plugin is not coded for all drivers. The current supported list is below:
hpsa
nhpsa
iavmd
nvme_pcie
lsi_mr3
lsi_msgpt3
lsi_msgpt35
smartpqi

If your driver is not listed then work with your hardware vendor to open an engineering to engineering case so we can work together to update the plugin code to interface with those drivers.

You may see one of the below errors if your driver is not listed above when running these commands which reflects we can't interact with the device to either pull the required information or turn on/off the LED:

esxcli storage core device physical get -d naa.6589cfc########93491
Unable to get location for device naa.6589cfc########93491: No LSU plugin can manage this device.

esxcli storage core device raid list -d naa.5000########4c2f
Unable to get location for device naa.5000########4c2f: Can not manage device!

See here for additional information:
With Deduplication & Compression enabled: Adding or Removing Disks
Remove Disk Groups or Devices from vSAN
Working with Individual Devices