ESXi host in vSAN cluster time out while going into maintenance mode using Ensure accessibility or Full data migration.

Products

VMware vSAN

Issue/Introduction

When the user attempts to place an ESXi host in a vSAN cluster into maintenance mode using either the "Ensure accessibility" or "Full data migration" options, a timeout occurs after around 60 minutes.
When inspecting the maintenance mode status through the UI, the task appears to be frozen at 100% completion. The accompanying notification indicates that "Objects Evacuated: 708 of 709" (the specific ESXi host name, object count, and quantity may vary depending on the environment). In some environments the enter maintenance mode task may be stuck at a lower percentage.

Environment

ESXi 7.0 Update 1 and above.

Cause

The maintenance mode time out occurs when decommissioning progress has not changed in 60 minutes and affected objects remain in the OBJECT_STATE_PENDING_RESYNC status. The clom logs for the impacted object will show the following (the date, time, and object uuid will vary depending on the environment):

2024-04-23T11:13:16.942Z info clomd[67698762] [Originator@6876] CLOMDecomUpdateObjState: Changed 8d965263-9ec6-2fc3-99e5-############ state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_PENDING_RESYNC

While for the rest of the good objects, we expect to see the following log messages in the clom logs. (UUID will vary depending on the environment):

2024-04-23T11:13:16.944Z info clomd[67698762] [Originator@6876] CLOMDecomUpdateObjState: Changed f8216b64-daae-96b3-2d29-############ state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_AFFECTED
2024-04-23T11:13:16.946Z info clomd[67698762] [Originator@6876] CLOMDecomUpdateObjState: Changed 8eab5264-b29a-8bb4-fcde-############ state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_AFFECTED

The user will not see any resync occurring while checking the resync status, and the clom logs will eventually display the following messages after 60 minutes:

2024-04-23T12:14:14.886Z warning clomd[67698762] [Originator@6876] CLOMDecomIsDecommissioningStuck: No Decommissioning progress made in last 3601 sec
2024-04-23T12:14:14.887Z error clomd[67698762] [Originator@6876] DecomProgressUpdate: Failing decommissioning. Stuck for more than 60 mins
2024-04-23T12:14:14.887Z warning clomd[67698762] [Originator@6876] CLOMDecomCleanupDecommissioning: Decom failed, start cleaning up resyncs

The vSAN traces will show the following for the affected object.

2024-04-23T09:24:05.498423 [103725855] [cpu0] [] CLOMTraceDecomDeltaOverlapCheck:3651: {'objUuid': '8d965263-9ec6-2fc3-99e5-############', 'hasOverlap': True, 'hasResyncDelta': True}

The reason resync is not progressing for the impacted objects is that the objects had durability components and were set to pending resync owing to delta overlap, and need to wait for delta resync.
When we check the object's configuration, we see RAID_D (Durability) components when there is no reason for them to be there, i.e. when all the hosts are online and not failing or going into maintenance mode, we should never see the RAID_D (Durability) components. These durability components are introduced when the ESXi host goes into maintenance mode, and a new "durability component" is created for the components that were stored on that host. This would allow all new VM I/O to be committed to both the existing and the durability components.

Ex object configuration with durability components when all hosts are online and participating in the vSAN cluster. Some object configurations are trimmed for readability purpose.

Object UUID: 8d965263-9ec6-2fc3-99e5-############
Version: 15
Health: healthy
.
.
Configuration:

RAID_1
RAID_D
Component: 039b5263-5a69-a50c-7840-############
Component State: ACTIVE, Address Space(B): 0 (0.00GB), Disk UUID: 523ed497-19a2-be53-5e17-############, Disk Name: naa.xxx
Votes: 1, Capacity Used(B): 12582912 (0.01GB), Physical Capacity Used(B): 4194304 (0.00GB), Host Name: esx01-r17.p01.xxx
Component: ce352766-26f1-da3f-944a-############
Component State: ACTIVE, Address Space(B): 0 (0.00GB), Disk UUID: 522dc47e-89c4-992d-e844-############, Disk Name: naa.xxx
Votes: 1, Capacity Used(B): 12582912 (0.01GB), Physical Capacity Used(B): 4194304 (0.00GB), Host Name: esx03-r15.p01.xxx
Component: 039b5263-d6f5-ab0c-76a3-############
Component State: ACTIVE, Address Space(B): 0 (0.00GB), Disk UUID: 52b791a5-458b-867b-36da-############, Disk Name: naa.xxx
Votes: 1, Capacity Used(B): 12582912 (0.01GB), Physical Capacity Used(B): 4194304 (0.00GB), Host Name: esx02-r02.p01.xxx
Witness: 6cb2a565-3a00-a65e-2fc5-############
Component State: ACTIVE, Address Space(B): 0 (0.00GB), Disk UUID: 52316ded-bb84-9075-9d92-############, Disk Name: naa.xxx
Votes: 1, Capacity Used(B): 12582912 (0.01GB), Physical Capacity Used(B): 4194304 (0.00GB), Host Name: esx06-r18.p01.xxx

Resolution

Once we have verified that there shouldn't be any durability components for specific object(s), we may proceed to merge the durability components with the data components using the following two approaches.

Owner abdicate the object, This will choose new owner for the object, refresh the state of the object and merge of the durability component to the data component.
Storage vMotion the VM to other vSAN cluster, This will trigger the merging of the durability component to the data component.

Use following command to owner abdicate the object.

vsish -e set /vmkModules/vsan/dom/ownerAbdicate <Affected_Object_UUID>

Example: vsish -e set /vmkModules/vsan/dom/ownerAbdicate 8d965263-9ec6-2fc3-99e5-############

Following the owner abdicate command, Wait a while and then examine the object configuration again; the durability components will have merged with the data components, and we will no longer see the RAID_D (Durability) component for the impacted object.

Object UUID: 8d965263-9ec6-2fc3-99e5-############
Version: 15
Health: healthy
.
.
Configuration:

RAID_1
Component: 039b5263-5a69-a50c-7840-############
Component State: ACTIVE, Address Space(B): 0 (0.00GB), Disk UUID: 523ed497-19a2-be53-5e17-############, Disk Name: naa.xxx
Votes: 1, Capacity Used(B): 12582912 (0.01GB), Physical Capacity Used(B): 4194304 (0.00GB), Host Name: esx01-r17.p01.xxx
Component: 039b5263-d6f5-ab0c-76a3-############
Component State: ACTIVE, Address Space(B): 0 (0.00GB), Disk UUID: 52b791a5-458b-867b-36da-############, Disk Name: naa.xxx
Votes: 1, Capacity Used(B): 12582912 (0.01GB), Physical Capacity Used(B): 4194304 (0.00GB), Host Name: esx02-r02.p01.xxx
Witness: 6cb2a565-3a00-a65e-2fc5-############
Component State: ACTIVE, Address Space(B): 0 (0.00GB), Disk UUID: 52316ded-bb84-9075-9d92-############, Disk Name: naa.xxx
Votes: 1, Capacity Used(B): 12582912 (0.01GB), Physical Capacity Used(B): 4194304 (0.00GB), Host Name: esx06-r18.p01.xxx

After the durability components are merged, The host will be able to enter into maintenance mode using ensure accessibility without any issues.