Cannot delete a vSAN disk group or remove the disk after detecting it as faulty

Products

VMware vSAN

Issue/Introduction

Symptoms:

After a disk failure, the disk/disk group cannot be deleted.
Operation health alert seen on skyline health.

Steps: Select vSAN cluster > Monitor > Skyline Health > Operation Health.
Error message while deleting disk group:

General vSAN error. vSAN disk data evacuation resource check has failed for disk or disk-group naa.58c###### (527c3109-####-####-####-########) with mode noAction on host hostname.com. Go to vSAN Data Migration Pre-Check page for more details.
Error message while deleting disk:

A general system error occurred: Failed to decommission disk naa.50#######: Failed to update diskState metadata for disk naa.5050####### with exception: Failed to write partition.

The disk group may show as mounted or unmounted in the UI. It may also throw the below warning:

Disk(s) 52a5e6e0-####-####-####-########, 52299652-####-####-####-########, 52108e7e-####-####-####-########, 52a8efb7-####-####-####-########, 527eee1c-####-####-####-######## are unmounted, but are part of a mounted disk group.
Manually removing the disk group may fail with the below errors:

Unable to remove device: Failed to get VsanInfo operation lock for diskOpLock, an operation is currently in progress(locked pid: 0), error: /tmp/.vsanDiskOpLock.lock.LOCK: timeout waiting for lock after 30 seconds. Lock is currently held by process 2167523 (vsanesxcmd: /usr/lib/vmware/vsan/bin/vsanesxcmd storage diskgroup mount -s naa.58ce#####)
Unable to remove device: Disk naa.58ce##### is not writable: Failed to write partition
Trying to read the partition may fail with the below error:

Error: Could not stat device /vmfs/devices/disks//vmfs/devices/disks/naa.58ce##### - No such file or directory.
Unable to get device /vmfs/devices/disks//vmfs/devices/disks/naa.58ce#####

Environment

VMware vSAN 7.x

VMware vSAN 8.x

Cause

The disk has failed in such a way that ESXi host is not able to communicate with it but is able to view it.
Log entries similar to below may be seen in /var/run/log/vmkernel.log:
YYYY-MM-DDTHH:MM:SSZ cpu4:2098069) ScsiDeviceIO: 4176: Cmd (0x45b9fc931fc6) 0x65, CmdSN Ox1352 from world 2102614 to dev "naa.50#######" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x7 0x27 0x0
- This sense code 0x7 0x27 means:
  
  [0x7] --> DATA PROTECT
  
  27/00 --> WRITE PROTECTED
- Meaning vSAN is unable to write to the device.
It is also possible to see medium errors on the disk such as below host logs /var/run/log/vmkernel.log:

YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu88:2098812) ScsiDeviceIO: 4686: Cmd (0x45df8dec7e80) 0x28, CmdSN 0x8989ca4 from world 0 to dev "naa. #######" failed H:0x0 D: 0x2 P: 0x0 Valid sense data: 0x3 0x11 0x1 Medium Error, LBA: 1038451712
- In this state as well, at times it may not be possible to remove the disk from the vSphere Client.
Parallely in /var/run/log/vobd.log, you will find below log entries -

YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2097762]: [vSANCorrelator] 6114756719521us: [vob.vsan.lsom.metadataURE] vSAN device 5269a55d-####-####-####-######## encountered unrecoverable read error. This disk will be evacuated and rebuilt. If the device is part of a dedup disk group, the entire disk group will be evacuated and rebuilt.

YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2097762]: [vSANCorrelator] 6114791656665us: [esx.problem.vob.vsan.lsom.metadataURE] Device 5269a55d-####-####-####-######## encountered an unrecoverable read error. It is in an unhealthy state and will get evacuated and rebuilt. If this device is part of a dedup diskgroup, the entire disk group will be evacuated and rebuilt.

YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2097762]: [vSANCorrelator] 6114791656785us: [esx.problem.vob.vsan.lsom.diskunhealthy] vSAN device 5269a55d-####-####-####-######## is unhealthy.

YYYY-MM-DDTHH:MM:SSZ In(14) vobd[2097762]: [vSANCorrelator] 6115042983153us: [esx.audit.vob.vsan.lsom.diskgrouprebuild] Diskgroup eui.01######### is rebuilt successfully after MEDIUM error. Old UUID 5269a55d-####-####-####-######## New UUID 52254c5f-####-####-####-########.
Physical disk has failed. To validate the same, check the hardware interface.

Example:
If iDRAC is in use then on the Dashboard you will see below warning stating "SYSTEM HAS CRITICAL ISSUES".

You will find that the status of the disk will suggest that it has failed.

Steps: Login to iDRAC > On the Dashboard page, select "Storage" option > select Physical Disks > Check the status of the physical disks.

Resolution

To remove the faulty disk from the disk group in this state, perform the below:

Place host with faulty disk into maintenance mode with Ensure accessibility.
Attempt to remove the faulty disk from CLI using How to manually remove and recreate a vSAN disk group using esxcli.
If this does, not work then remove the faulty disk physically from the server and replace with new disk, and then reattempt step 1.
If this fails as well, reboot the host and then attempt to remove from CLI once more.
Now add the newly added disk to the disk group.

Additional Information

How to remove a disk from a vSAN disk group/host

Cannot delete a vSAN disk group or remove the disk after detecting it as faulty

Article ID: 386140

Updated On:

Products

Issue/Introduction

Symptoms:

Environment

Cause

Resolution

Additional Information

Feedback