Users might encounter one or all of the problems listed below on vSAN OSA Clusters running vSphere 8.0U3 and lower versions
2024-05-16T08:41:35.652Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle
2024-05-16T08:41:35.653Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle
2024-05-16T08:41:35.653Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle
2024-05-16T08:41:35.653Z In(182) vmkernel: cpu17:2100610)LSOM: LSOM_GetComponentHandle:2550: Disk 529773e4-cf50-0b45-71b3-46501b8cea6f, Bad disk state 16, failing get handle
vSphere 7.X
vSphere 8.0.U1.x
vSphere 8.0.U2.x
vSphere 8.0.U3
A rare issue during disk failures was found to be stalling certain background operation in the vSAN backend subsystem .This stalled task at vSAN backend subsystem also hangs any attempt to unmount the VSAN diskgroup or maintenance mode/reboot of ESX host.
VMware has addressed this issue on the release ESXi 8.0 Patch 04. Please follow below workaround if you have already encountered the issue or seek assistance from VMware support.
Workaround
Identify the Diskgroup UUID of the diskgroup which contains the failed/bad disk, execute the steps below.
1. Set exclusion list against the affected DG on all the hosts, this allows component creation/VM power-on on other DGs
/usr/lib/vmware/vsan/bin/clom-tool set-global-exclusion-list --exclusion-list=<DG_UUID>
2. Once this is set, check the vmkernel on host where the disk has failed , check if the log spew "Bad disk state 16" errors have stopped.
3. Attempt to vMotion the VMs from the affected host.
4. Place the host into maintenance mode and reboot.
5. Once the host is up, attempt to unmount the DG and proceed to replace the failed disk(s).
6. Recreate the Diskgroup post replacing failed disk and take the host out of maintenance mode.
7. Restart CLOMD service on all hosts to clear the exclusion list.