When VMware vSAN is configured with Deduplication and Compression enabled, a failure of any single disk results in the failure of the entire Disk Group to which the disk belongs.
The associated vSAN Skyline Health test for Operation health reports the entire Disk Group as Offline.
naa.#####) is no longer visiblevSAN deduplication occurs at the Disk Group level across the cluster. As a result, if a single disk in the Disk Group fails, the entire Disk Group fails. The UI reflects the Disk Group failure, but it does not display the identifying information of the device that triggered the failure.
To identify the specific device that caused the failure, follow these steps:
Unknown: Device: Unknown Display Name: Unknown Is SSD: false VSAN UUID: ########-########-####-####-####-########226a VSAN Disk Group UUID: VSAN Disk Group Name: Used by this host: false In CMMDS: false On-disk format version: -1 Deduplication: false Compression: false Checksum: Checksum OK: false Is Capacity Tier: false Encryption Metadata Checksum OK: true Encryption: false DiskKeyLoaded: false Is Mounted: false Creation Time: Unknown
root@esx01:~] vdq -iHMappings: DiskMapping[0]: SSD: naa.58ce########fec5 MD: naa.58ce########a7f9 MD: naa.58ce#######bbd1 MD: naa.58ce#######02a5 MD: naa.58ce########9d69 MD: naa.58ce########aaf5 MD: naa.58ce########a7e5 MD: ########-########-####-####-####-########226a (This is a masked UUID)To identify the display name of the disk, and if the failure is recent enough, run the following command:
grep <UUID> /var/log/vmkernel.log
You should see output similar to the following::
2021-01-09T05:45:41.638Z cpu0:7053521)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD ########-########-####-####-####-########226a (naa.58ce######aad9:2)
From the CLI of the ESXi host Remove the Disk Group using the No Data Migration option using the following command:
esxcli vsan storage remove -u <Disk-UUID> -m noAction
flags:
-u <Disk-UUID>: Specifies the UUID of the disk group.
-m noAction: This is the CLI equivalent of "No Data Migration." It tells vSAN to delete the group immediately without evacuating the components to other hosts.
*It is critical that the correct UUID is removed. If the wrong UUID is removed this could cause data loss.
Once the disk is removed. Place the host into Maintenance Mode with Ensure Access. Shut down the host and physically replace the disk(s). Power the host back on and use the vSAN Disk Management in vCenter to recreate the Disk Group.
If the disk removal task fails, reboot the host and try again. Always attempt to place the ESXi host into Maintenance Mode with Ensure Accessibility first. If the ESXi host cannot enter Maintenance Mode normally, you must force it into Maintenance Mode using the No Data Migration option.
If you are uncertain about the impact/risk to VM object health when your reviewing the state of the data in vSAN Skyline Health, contact Global Support for assistance in assessing the potential impact.
If necessary, retrieve the path information for the failed device to assist with identification.
From the ESXi Shell, run the following command:
# esxcfg-mpath -bd <naa.disk identifier device>
For the example in the Resolution section, the command and example output is:F
# esxcfg-mpath -bd naa.58ce########aad9naa.58ce########d9 : VMware Serial Attached SCSI Disk (naa.58ce########aad9)vmhba1:C0:T1:L0 LUN:0 state:active sas Adapter: 5005########8c11 Target: 5000########02af
The device is target #1 on vmhba1.
You can also retrieve the physical location of the device.
From the ESXi Shell, run these commands:
# esxcli storage core device physical get -d <naa identifier device># esxcli storage core device raid list -d <naa identifier device>
The command and example output is:
# esxcli storage core device physical get -d naa.58ce########aad9 Physical Location: enclosure 2, slot 5
Or
Run
# esxcli storage core device raid list -d naa.58ce########aad9 Physical Location: enclosure 2, slot 5
Note: The above commands may not work with certain drivers as the vSAN disk serviceability plugin is not coded for all drivers. The current supported list is below:
hpsa
nhpsa
iavmd
nvme_pcie
lsi_mr3
lsi_msgpt3
lsi_msgpt35
smartpqi
If your driver is not listed, work with your hardware vendor to open an engineering-to-engineering case so you can collaborate on updating the plugin code to interface with those drivers.
You may encounter one of the following errors if your driver is not listed when running these commands. This indicates that the system cannot interact with the device to retrieve the required information or control the LED:
~ # esxcli storage core device physical get -d naa.6589cfc########93491Unable to get location for device naa.6589cfc########93491: No LSU plugin can manage this device.
~ # esxcli storage core device raid list -d naa.5000########4c2fUnable to get location for device naa.5000########4c2f: Can not manage device!
See here for additional information: