One or more failed disk on vSAN OSA cluster might cause VM creation/power-on failure and unmount failures
search cancel

One or more failed disk on vSAN OSA cluster might cause VM creation/power-on failure and unmount failures

book

Article ID: 379553

calendar_today

Updated On:

Products

VMware vSAN VMware vSAN 7.x VMware vSAN 8.x

Issue/Introduction

Users might encounter one or all of the problems listed below on vSAN OSA Clusters running vSphere 8.0U3 and lower versions

  1. One more VMs fail to power on and new VM creation fails to create on vSAN OSA datastore.
  2. Failed Disk(s) in One or more Disk Groups on a host will not unmount successfully.
  3. Vmkernel logs might be spewed with errors seen below:
2024-05-16T08:41:35.652Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle

2024-05-16T08:41:35.653Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle

2024-05-16T08:41:35.653Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle

2024-05-16T08:41:35.653Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle

Environment

vSphere 7.X

vSphere 8.0.U1.x

vSphere 8.0.U2.x

vSphere 8.0.U3

Cause

A rare issue during disk failures was found to be stalling certain background operation in the vSAN backend subsystem .This stalled task at vSAN backend subsystem also hangs any attempt to unmount the VSAN diskgroup or maintenance mode/reboot of ESX host.

Resolution

VMware has addressed this issue on the release ESXi 8.0 Patch 04. Please follow below workaround if you have already encountered the issue or seek assistance from VMware support.

Workaround

Identify the Diskgroup UUID of the diskgroup which contains the failed/bad disk, execute the steps below.

1. Set exclusion list against the affected DG on all the hosts, this allows component creation/VM power-on on other DGs

/usr/lib/vmware/vsan/bin/clom-tool set-global-exclusion-list --exclusion-list=<DG_UUID>

2. Once this is set, check the vmkernel on host  where the disk has failed , check if the log spew "Bad disk state 16" errors have stopped.

3. Attempt to vMotion the VMs from the affected host.

4. Place the host into maintenance mode and reboot.

5. Once the host is up, attempt to unmount the DG and proceed to replace the failed disk(s).

6. Recreate the Diskgroup post replacing failed disk and take the host out of maintenance mode.

7. Restart CLOMD service on all hosts to clear the exclusion list.