ESXi host in vSAN cluster time out while going into maintenance mode using Ensure accessibility or Full data migration.
search cancel

ESXi host in vSAN cluster time out while going into maintenance mode using Ensure accessibility or Full data migration.

book

Article ID: 367172

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

  • When the user attempts to place an ESXi host in a vSAN cluster into maintenance mode using either the "Ensure accessibility" or "Full data migration" options, a timeout occurs after around 60 minutes.

  • When inspecting the maintenance mode status through the UI, the task appears to be frozen at 100% completion. The accompanying notification indicates that "Objects Evacuated: 708 of 709" (the specific ESXi host name, object count, and quantity may vary depending on the environment).

Environment

ESXi 7.0 Update 1 and above.

Cause

  • The maintenance mode time out occurs when decommissioning progress has not changed in 60 minutes and affected objects remain in the OBJECT_STATE_PENDING_RESYNC status. The clom logs for the impacted object will show the following (the date, time, and object uuid will vary depending on the environment):

2024-04-23T11:13:16.942Z info clomd[67698762] [Originator@6876] CLOMDecomUpdateObjState: Changed 8d965263-9ec6-2fc3-99e5-043f72f60646 state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_PENDING_RESYNC

  • While for the rest of the good objects, We expect to see the following log messages in the clom logs. (UUID will vary depending on the environment):

2024-04-23T11:13:16.944Z info clomd[67698762] [Originator@6876] CLOMDecomUpdateObjState: Changed f8216b64-daae-96b3-2d29-b8599fdd53c4 state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_AFFECTED
2024-04-23T11:13:16.946Z info clomd[67698762] [Originator@6876] CLOMDecomUpdateObjState: Changed 8eab5264-b29a-8bb4-fcde-b8599fe887fc state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_AFFECTED

  • The user will not see any resync occurring while checking the resync status, and the clom logs will eventually display the following messages after 60 minutes:

2024-04-23T12:14:14.886Z warning clomd[67698762] [Originator@6876] CLOMDecomIsDecommissioningStuck: No Decommissioning progress made in last 3601 sec
2024-04-23T12:14:14.887Z error clomd[67698762] [Originator@6876] DecomProgressUpdate: Failing decommissioning. Stuck for more than 60 mins
2024-04-23T12:14:14.887Z warning clomd[67698762] [Originator@6876] CLOMDecomCleanupDecommissioning: Decom failed, start cleaning up resyncs

  • The vSAN traces will show the following for the affected object.

2024-04-23T09:24:05.498423 [103725855] [cpu0] [] CLOMTraceDecomDeltaOverlapCheck:3651: {'objUuid': '8d965263-9ec6-2fc3-99e5-043f72f60646', 'hasOverlap': True, 'hasResyncDelta': True}

  • The reason resync is not progressing for the impacted objects is that the objects had durability components and were set to pending resync owing to delta overlap, and need to wait for delta resync.
     
  • When we check the object's configuration, we see RAID_D (Durability) components when there is no reason for them to be there, i.e. when all the hosts are online and not failing or going into maintenance mode, we should never see the RAID_D (Durability) components. These durability components are introduced when the ESXi host goes into maintenance mode, and a new "durability component" is created for the components that were stored on that host. This would allow all new VM I/O to be committed to both the existing and the durability components.

Ex object configuration with durability components when all hosts are online and participating in the vSAN cluster. Some object configurations are trimmed for readability purpose.

Object UUID: 8d965263-9ec6-2fc3-99e5-043f72f60646
   Version: 15
   Health: healthy
.
.
   Configuration: 
      
      RAID_1
         RAID_D
            Component: 039b5263-5a69-a50c-7840-043f72f60646
              Component State: ACTIVE,  Address Space(B): 0 (0.00GB),  Disk UUID: 523ed497-19a2-be53-5e17-980b2d36dc9b,  Disk Name: naa.xxx
              Votes: 1,  Capacity Used(B): 12582912 (0.01GB),  Physical Capacity Used(B): 4194304 (0.00GB),  Host Name: esx01-r17.p01.xxx
            Component: ce352766-26f1-da3f-944a-b8cef603283c
              Component State: ACTIVE,  Address Space(B): 0 (0.00GB),  Disk UUID: 522dc47e-89c4-992d-e844-aaf60edf82c4,  Disk Name: naa.xxx
              Votes: 1,  Capacity Used(B): 12582912 (0.01GB),  Physical Capacity Used(B): 4194304 (0.00GB),  Host Name: esx03-r15.p01.xxx
         Component: 039b5263-d6f5-ab0c-76a3-043f72f60646
           Component State: ACTIVE,  Address Space(B): 0 (0.00GB),  Disk UUID: 52b791a5-458b-867b-36da-7f85d2b3e517,  Disk Name: naa.xxx
           Votes: 1,  Capacity Used(B): 12582912 (0.01GB),  Physical Capacity Used(B): 4194304 (0.00GB),  Host Name: esx02-r02.p01.xxx
      Witness: 6cb2a565-3a00-a65e-2fc5-b8cef6568f32
        Component State: ACTIVE,  Address Space(B): 0 (0.00GB),  Disk UUID: 52316ded-bb84-9075-9d92-d9d2824bd6e1,  Disk Name: naa.xxx
        Votes: 1,  Capacity Used(B): 12582912 (0.01GB),  Physical Capacity Used(B): 4194304 (0.00GB),  Host Name: esx06-r18.p01.xxx

Resolution

Once we have verified that there shouldn't be any durability components for specific object(s), we may proceed to merge the durability components with the data components using the following two approaches.

  • Owner abdicate the object, This will choose new owner for the object, refresh the state of the object and merge of the durability component to the data component.
  • Storage vMotion the VM to other vSAN cluster, This will trigger the merging of the durability component to the data component.

Use following command to owner abdicate the object.

vsish -e set /vmkModules/vsan/dom/ownerAbdicate <Affected_Object_UUID>

Example: vsish -e set /vmkModules/vsan/dom/ownerAbdicate 8d965263-9ec6-2fc3-99e5-043f72f60646

Following the owner abdicate command, Wait a while and then examine the object configuration again; the durability components will have merged with the data components, and we will no longer see the RAID_D (Durability) component for the impacted object.

Object UUID: 8d965263-9ec6-2fc3-99e5-043f72f60646
   Version: 15
   Health: healthy
.
.
   Configuration: 
      
      RAID_1
         Component: 039b5263-5a69-a50c-7840-043f72f60646
           Component State: ACTIVE,  Address Space(B): 0 (0.00GB),  Disk UUID: 523ed497-19a2-be53-5e17-980b2d36dc9b,  Disk Name: naa.xxx
           Votes: 1,  Capacity Used(B): 12582912 (0.01GB),  Physical Capacity Used(B): 4194304 (0.00GB),  Host Name: esx01-r17.p01.xxx
         Component: 039b5263-d6f5-ab0c-76a3-043f72f60646
           Component State: ACTIVE,  Address Space(B): 0 (0.00GB),  Disk UUID: 52b791a5-458b-867b-36da-7f85d2b3e517,  Disk Name: naa.xxx
           Votes: 1,  Capacity Used(B): 12582912 (0.01GB),  Physical Capacity Used(B): 4194304 (0.00GB),  Host Name: esx02-r02.p01.xxx
      Witness: 6cb2a565-3a00-a65e-2fc5-b8cef6568f32
        Component State: ACTIVE,  Address Space(B): 0 (0.00GB),  Disk UUID: 52316ded-bb84-9075-9d92-d9d2824bd6e1,  Disk Name: naa.xxx
        Votes: 1,  Capacity Used(B): 12582912 (0.01GB),  Physical Capacity Used(B): 4194304 (0.00GB),  Host Name: esx06-r18.p01.xxx

After the durability components are merged, The host will be able to enter into maintenance mode using ensure accessibility without any issues.