Identifying and replacing a failed cache or capacity disk in vSAN OSA disk group when vSAN deduplication is enabled

Products

VMware vSAN 7.x VMware vSAN 8.x VMware vSAN 6.x

Issue/Introduction

When VMware vSAN is configured with Deduplication and Compression enabled, a failure of any single disk results in the failure of the entire Disk Group to which the disk belongs.

The associated vSAN Skyline Health test for Operation health reports the entire Disk Group as Offline.

This status is also reflected in Configure > vSAN > Disk Management.

In the example shown, one disk is marked as Absent. Due to the nature of the failure, only the disk UUID is displayed; the original disk name (e.g., naa.#####) is no longer visible

Environment

VMware vSAN OSA

Cause

vSAN deduplication occurs at the Disk Group level across the cluster. As a result, if a single disk in the Disk Group fails, the entire Disk Group fails. The UI reflects the Disk Group failure, but it does not display the identifying information of the device that triggered the failure.

Resolution

To identify the specific device that caused the failure, follow these steps:

Log in to the applicable ESXi host via SSH or KVM/physical console.
List vSAN disks using this command:

# esxcli vsan storage list |less
You will see output like this for a failed disk
Unknown:
Device: Unknown
Display Name: Unknown
Is SSD: false
VSAN UUID: ########-########-####-####-####-########226a
VSAN Disk Group UUID:
VSAN Disk Group Name:
Used by this host: false
In CMMDS: false
On-disk format version: -1
Deduplication: false
Compression: false
Checksum:
Checksum OK: false
Is Capacity Tier: false
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: false
Creation Time: Unknown
You can also use the following command to list the disk mappings on the host and identify the failed disk: If the disk is listed as a UUID and not the disk identifier, this indicates that vSAN has failed out the disk, as shown below:
[root@esx01:~] vdq -iH
Mappings:
DiskMapping[0]:
SSD: naa.58ce########fec5
MD: naa.58ce########a7f9
MD: naa.58ce#######bbd1
MD: naa.58ce#######02a5
MD: naa.58ce########9d69
MD: naa.58ce########aaf5
MD: naa.58ce########a7e5
MD: ########-########-####-####-####-########226a (This is a masked UUID)
To identify the display name of the disk, and if the failure is recent enough, run the following command:
grep <UUID> /var/log/vmkernel.log
You should see output similar to the following::

2021-01-09T05:45:41.638Z cpu0:7053521)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD ########-########-####-####-####-########226a (naa.58ce######aad9:2)
From the CLI of the ESXi host Remove the Disk Group using the No Data Migration option using the following command:
esxcli vsan storage remove -u <Disk-UUID> -m noAction flags:
-u <Disk-UUID>: Specifies the UUID of the disk group.
-m noAction: This is the CLI equivalent of "No Data Migration." It tells vSAN to delete the group immediately without evacuating the components to other hosts.
*It is critical that the correct UUID is removed. If the wrong UUID is removed this could cause data loss.
Once the disk is removed. Place the host into Maintenance Mode with Ensure Access. Shut down the host and physically replace the disk(s). Power the host back on and use the vSAN Disk Management in vCenter to recreate the Disk Group.
If the disk removal task fails, reboot the host and try again. Always attempt to place the ESXi host into Maintenance Mode with Ensure Accessibility first. If the ESXi host cannot enter Maintenance Mode normally, you must force it into Maintenance Mode using the No Data Migration option.
If you are uncertain about the impact/risk to VM object health when your reviewing the state of the data in vSAN Skyline Health, contact Global Support for assistance in assessing the potential impact.

Additional Information

If necessary, retrieve the path information for the failed device to assist with identification.

From the ESXi Shell, run the following command:
# esxcfg-mpath -bd <naa.disk identifier device>
For the example in the Resolution section, the command and example output is:F
# esxcfg-mpath -bd naa.58ce########aad9
naa.58ce########d9 : VMware Serial Attached SCSI Disk (naa.58ce########aad9)
vmhba1:C0:T1:L0 LUN:0 state:active sas Adapter: 5005########8c11 Target: 5000########02af
The device is target #1 on vmhba1.
You can also retrieve the physical location of the device.
From the ESXi Shell, run these commands:
# esxcli storage core device physical get -d <naa identifier device>
# esxcli storage core device raid list -d <naa identifier device>
The command and example output is:
# esxcli storage core device physical get -d naa.58ce########aad9
Physical Location: enclosure 2, slot 5
Or
Run
# esxcli storage core device raid list -d naa.58ce########aad9
Physical Location: enclosure 2, slot 5
Note: The above commands may not work with certain drivers as the vSAN disk serviceability plugin is not coded for all drivers. The current supported list is below:
hpsa
nhpsa
iavmd
nvme_pcie
lsi_mr3
lsi_msgpt3
lsi_msgpt35
smartpqi
If your driver is not listed, work with your hardware vendor to open an engineering-to-engineering case so you can collaborate on updating the plugin code to interface with those drivers.
You may encounter one of the following errors if your driver is not listed when running these commands. This indicates that the system cannot interact with the device to retrieve the required information or control the LED:
~ # esxcli storage core device physical get -d naa.6589cfc########93491
Unable to get location for device naa.6589cfc########93491: No LSU plugin can manage this device.
~ # esxcli storage core device raid list -d naa.5000########4c2f
Unable to get location for device naa.5000########4c2f: Can not manage device!

See here for additional information: