vSAN Host fails to enter maintenance mode with ensure accessibility mode in a Stretched Cluster setup

Products

VMware vSAN

Issue/Introduction

Symptoms:

When you place a host into maintenance mode with "Ensure Accessibility" mode; it may stay at 100% for a long time or fail after 60 mins.
This issue is specific to a storage policy using a Site Disaster tolerance set to "None - stretched cluster" and Failures to tolerate set to RAID1/5/6
Issue happens only with Ensure Accessibility maintenance mode.
vSAN Skyline health UI and "esxcli vsan debug resync summary get" may not report any objects in resync.
All vSAN objects may report as "healthy" from vSAN Skyline health and from "esxcli vsan debug object health summary get".

In this example; we noticed "Objects Evacuated" as 447 of 448; with "Data evacuated" as 5457292 MB of 5457292 MB from vCenter UI and CLI. ( Object count and the amount may change on the environment)

From /var/run/log/clomd.log on the host in question, we could see one of the object stuck in "OBJECT_STATE_PENDING_RESYNC"

2021-10-13T01:43:20.786Z info clomd[2104445] [Originator@6876] CLOMDecomUpdateObjState: Changed 1ef3a55f-####-####-####-########688 state from OBJECT_STATE_PENDING_RESYNC to OBJECT_STATE_DONE
2021-10-13T00:42:24.094Z info clomd[2104445] [Originator@6876] CLOMDecomUpdateObjState: Changed 1ef3a55f-####-####-####-########688 state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_PENDING_RESYNC

The above-mentioned object had "Base" and its "delta" components in ACTIVE state and the CLEANUP workitem for this object kept failing.

2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOMCleanup_Object: Refinalizing 1ef3a55f-####-####-####-########688
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_VerifyPolicyCompliance: Obj 1ef3a55f-####-####-####-########688 is not compliant, reason:0x40
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_VerifyPolicyCompliance: Current Policy
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOMLogConfigurationPolicy: Object size 273804165120 bytes with policy: (("stripeWidth" i4) ("capacity" (l0 l273804165120)) ("proportionalCapacity" 
(i0 i100)) ("affinity" [ 52954f6a-####-####-####-########fc9 524a053a-####-####-####-########3a2 52fe16f7-####-####-####-########956 5221b912-####-####-####-########6ed 5286cd85-####-####-####-########55d]) 
("storageType" "AllFlash") ("replicaPreference" "Raid5Lower"))

2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_VerifyPolicyCompliance: Target Policy
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOMLogConfigurationPolicy: Object size 273804165120 bytes with policy: (("stripeWidth" i4) ("cacheReservation" i0) ("proportionalCapacity" i0) 
("hostFailuresToTolerate" i0) ("forceProvisioning" i0) ("spbmProfileId" "e8c34147-####-####-####-########242") ("spbmProfileGenerationNumber" l+2) ("objectVersion" i13) ("replicaPreference" "Capacity") ("iopsLimit" i0) 
("checksumDisabled" i0) ("subFailuresToTolerate" i1) ("CSN" l8548) ("SCSN" l8548) ("spbmProfileName" "Default VMC VM Storage Policy") ("locality" "None"))

2021-10-13T00:47:44.576Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_IsHFTPreserved: preConfigActive: 2, postActiveAndTransient: 2, preUFT/LFT: 0/1, postUFT/LFT: 0/0
2021-10-13T00:47:44.576Z warning clomd[2098747] [Originator@6876 opID=1804290700] CLOM_ComplianceSanityCheck: Config is non-compliant. Failed to fix the config.
2021-10-13T00:47:44.576Z error clomd[2098747] [Originator@6876 opID=1804290700] CLOM_FixObjectWrapper: StepFix failed for object ########-####-####-####-########b688: Failure

2021-10-13T00:47:44.576Z error clomd[2098747] [Originator@6876 opID=1804290700] CLOMReconfigure: exit: obj ########-####-####-####-########b688 transiantCapGenerated - total: 0, site1: 0, site2: 0, 
workItem type CLEANUP configDelay 0 newConfigGenerated 0 status Failure

Environment

VMware vSAN 7.0.x
VMware vSAN 8.0.x

Cause

The object has a DELTA subtree with both Base and Delta components in ACTIVE STATE. When CLOM tries to clean up such config it fails and logs as:

2021-10-13T00:47:44.576Z warning clomd[2098747] [Originator@6876 opID=1804290700] CLOM_ComplianceSanityCheck: Config is non-compliant. Failed to fix the config.

CLOM failed the cleanup operation because it thought SubFailuresToTolerate was being reduced.

2021-10-13T00:47:44.576Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_IsHFTPreserved: preConfigActive: 2, postActiveAndTransient: 2, preUFT/LFT: 0/1, postUFT/LFT: 0/0

Resolution

Don't use a storage policy in a stretch cluster with Site Disaster tolerance set to either "None - standard cluster" or "None - stretched cluster" with a Failures to tolerate set to RAID1/5/6

Steps to remediate if already hit this issue

Check the storage/capacity usage on each site/fault domain of the Stretch Cluster by selecting the vSAN cluster > Configure > vSAN > Fault Domains
Add a host to a site /fault domain that has less usage if "affected Object(s)" size > Space remaining.
Identify the VMs/VMDKs which reported "OBJECT_STATE_PENDING_RESYNC" from clomd.log by running: cat /var/log/clomd.log |grep OBJECT_STATE_PENDING_RESYNC|less
Create a new policy from "VM Storage Policies" with "Site disaster tolerance = None - keep data on Preferred" (If the Preferred site has more capacity) and apply this to the affected VMDK/object.
1. Ex: If the object/VMDK has Site disaster tolerance set to "None - stretched cluster" and Failures to tolerate is set to "1 failure - RAID-1 (Mirroring)"; change/create a new policy with Site disaster tolerance to None - keep data on Preferred (stretched cluster) and Failures to tolerate to 1 failure - RAID-1 (Mirroring)" and apply this to the affected VMDK/object.
Wait for resync to complete
Repeat steps 1-4 on all the objects which reported "OBJECT_STATE_PENDING_RESYNC" from clomd.log in batches.
Re-attempt putting the host into maintenance mode after resync is complete on all reported objects.

Notes:

Changing policies on all affected objects/VMDKs from a cluster would result in a cluster-wide huge resync; hence, it is recommended to identify the culprit objects and apply the above fix.