Identifying and replacing a failed cache or capacity disk in vSAN OSA disk group when vSAN deduplication is enabled
search cancel

Identifying and replacing a failed cache or capacity disk in vSAN OSA disk group when vSAN deduplication is enabled

book

Article ID: 327008

calendar_today

Updated On:

Products

VMware vSAN 7.x VMware vSAN 8.x VMware vSAN 6.x

Issue/Introduction

When VMware vSAN is configured with Deduplication and Compression enabled, a failure of any single disk results in the failure of the entire Disk Group to which the disk belongs.

The associated vSAN Skyline Health test for Operation health reports the entire Disk Group as Offline.

 



This status is also reflected in Configure > vSAN > Disk Management.



In the example shown, one disk is marked as Absent. Due to the nature of the failure, only the disk UUID is displayed; the original disk name (e.g., naa.#####) is no longer visible

Environment

VMware vSAN OSA

Cause

vSAN deduplication occurs at the Disk Group level across the cluster. As a result, if a single disk in the Disk Group fails, the entire Disk Group fails. The UI reflects the Disk Group failure, but it does not display the identifying information of the device that triggered the failure.

Resolution

To identify the specific device that caused the failure, follow these steps:

  1.  Log in to the applicable ESXi host via SSH or KVM/physical console.
  2.  List vSAN disks using this command:

    # esxcli vsan storage list |less

  3. You will see output like this for a failed disk
    Unknown:
       Device: Unknown
       Display Name: Unknown
       Is SSD: false
       VSAN UUID: ########-########-####-####-####-########226a
       VSAN Disk Group UUID:
       VSAN Disk Group Name:
       Used by this host: false
       In CMMDS: false
       On-disk format version: -1
       Deduplication: false
       Compression: false
       Checksum:
       Checksum OK: false
       Is Capacity Tier: false
       Encryption Metadata Checksum OK: true
       Encryption: false
       DiskKeyLoaded: false
       Is Mounted: false
       Creation Time: Unknown
  4. You can also use the following command to list the disk mappings on the host and identify the failed disk: If the disk is listed as a UUID and not the disk identifier, this indicates that vSAN has failed out the disk, as shown below:
    [root@esx01:~] vdq -iH
    Mappings:
       DiskMapping[0]:
               SSD:  naa.58ce########fec5
                MD:  naa.58ce########a7f9
                MD:  naa.58ce#######bbd1
                MD:  naa.58ce#######02a5
                MD:  naa.58ce########9d69
                MD:  naa.58ce########aaf5
                MD:  naa.58ce########a7e5
                MD:  ########-########-####-####-####-########226a (This is a masked UUID)
  5. To identify the display name of the disk, and if the failure is recent enough, run the following command:

    grep <UUID> /var/log/vmkernel.log

    You should see output similar to the following::


    2021-01-09T05:45:41.638Z cpu0:7053521)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD ########-########-####-####-####-########226a (naa.58ce######aad9:2)

  6. From the CLI of the ESXi host Remove the Disk Group using the No Data Migration option using the following command:

    esxcli vsan storage remove -u <Disk-UUID> -m noAction

    flags:

    -u <Disk-UUID>: Specifies the UUID of the disk group.

    -m noAction: This is the CLI equivalent of "No Data Migration." It tells vSAN to delete the group immediately without evacuating the components to other hosts.

    *It is critical that the correct UUID is removed. If the wrong UUID is removed this could cause data loss. 

  7. Once the disk is removed. Place the host into Maintenance Mode with Ensure Access. Shut down the host and physically replace the disk(s). Power the host back on and use the vSAN Disk Management in vCenter to recreate the Disk Group.

  8. If the disk removal task fails, reboot the host and try again. Always attempt to place the ESXi host into Maintenance Mode with Ensure Accessibility first. If the ESXi host cannot enter Maintenance Mode normally, you must force it into Maintenance Mode using the No Data Migration option.

  9. If you are uncertain about the impact/risk to VM object health when your reviewing the state of the data in vSAN Skyline Health, contact Global Support for assistance in assessing the potential impact.

Additional Information

If necessary, retrieve the path information for the failed device to assist with identification.

  1. From the ESXi Shell, run the following command:

    # esxcfg-mpath -bd <naa.disk identifier device>

    For the example in the Resolution section, the command and example output is:F

    # esxcfg-mpath -bd naa.58ce########aad9
    naa.58ce########d9 : VMware Serial Attached SCSI Disk (naa.58ce########aad9)
    vmhba1:C0:T1:L0 LUN:0 state:active sas Adapter: 5005########8c11 Target: 5000########02af

    The device is target #1 on vmhba1.

  2. You can also retrieve the physical location of the device.
    From the ESXi Shell, run these commands:

    # esxcli storage core device physical get -d <naa identifier device>
    # esxcli storage core device raid list -d <naa identifier device>

    The command and example output is:

    # esxcli storage core device physical get -d naa.58ce########aad9
     Physical Location: enclosure 2, slot 5

    Or

  3. Run 

    # esxcli storage core device raid list -d naa.58ce########aad9
     Physical Location: enclosure 2, slot 5

    Note: The above commands may not work with certain drivers as the vSAN disk serviceability plugin is not coded for all drivers. The current supported list is below:
    hpsa
    nhpsa
    iavmd
    nvme_pcie
    lsi_mr3
    lsi_msgpt3
    lsi_msgpt35
    smartpqi

  4. If your driver is not listed, work with your hardware vendor to open an engineering-to-engineering case so you can collaborate on updating the plugin code to interface with those drivers.

    You may encounter one of the following errors if your driver is not listed when running these commands. This indicates that the system cannot interact with the device to retrieve the required information or control the LED:

    ~ # esxcli storage core device physical get -d naa.6589cfc########93491
    Unable to get location for device naa.6589cfc########93491: No LSU plugin can manage this device.

    ~ # esxcli storage core device raid list -d naa.5000########4c2f
    Unable to get location for device naa.5000########4c2f: Can not manage device!

 

 

See here for additional information: