vSAN Host fails to enter maintenance mode with ensure accessibility mode in a Stretched Cluster setup
search cancel

vSAN Host fails to enter maintenance mode with ensure accessibility mode in a Stretched Cluster setup

book

Article ID: 326822

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
  • When you place a host into maintenance mode with "Ensure Accessibility" mode; it may stay at 100% for a long time or fail after 60 mins.
  • This issue is specific to a storage policy using a Site Disaster tolerance set to "None - stretched cluster" and Failures to tolerate set to RAID1/5/6
  • Issue happens only with Ensure Accessibility maintenance mode.
  • vSAN Skyline health UI and "esxcli vsan debug resync summary get" may not report any objects in resync.
  • All vSAN objects may report as "healthy" from vSAN Skyline health and from "esxcli vsan debug object health summary get".

 

  • In this example; we noticed "Objects Evacuated" as 447 of 448; with "Data evacuated" as 5457292 MB of 5457292 MB from vCenter UI and CLI. ( Object count and the amount may change on the environment)
  • From /var/run/log/clomd.log on the host in question, we could see one of the object stuck in "OBJECT_STATE_PENDING_RESYNC" 
2021-10-13T01:43:20.786Z info clomd[2104445] [Originator@6876] CLOMDecomUpdateObjState: Changed 1ef3a55f-####-####-####-########688 state from OBJECT_STATE_PENDING_RESYNC to OBJECT_STATE_DONE
2021-10-13T00:42:24.094Z info clomd[2104445] [Originator@6876] CLOMDecomUpdateObjState: Changed 1ef3a55f-####-####-####-########688 state from OBJECT_STATE_LIKELY_AFFECTED to OBJECT_STATE_PENDING_RESYNC
  • The above-mentioned object had "Base" and its "delta" components in ACTIVE state and the CLEANUP workitem for this object kept failing.
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOMCleanup_Object: Refinalizing 1ef3a55f-####-####-####-########688
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_VerifyPolicyCompliance: Obj 1ef3a55f-####-####-####-########688 is not compliant, reason:0x40
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_VerifyPolicyCompliance: Current Policy
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOMLogConfigurationPolicy: Object size 273804165120 bytes with policy: (("stripeWidth" i4) ("capacity" (l0 l273804165120)) ("proportionalCapacity" 
(i0 i100)) ("affinity" [ 52954f6a-####-####-####-########fc9 524a053a-####-####-####-########3a2 52fe16f7-####-####-####-########956 5221b912-####-####-####-########6ed 5286cd85-####-####-####-########55d]) 
("storageType" "AllFlash") ("replicaPreference" "Raid5Lower"))

2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_VerifyPolicyCompliance: Target Policy
2021-10-13T00:47:44.575Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOMLogConfigurationPolicy: Object size 273804165120 bytes with policy: (("stripeWidth" i4) ("cacheReservation" i0) ("proportionalCapacity" i0) 
("hostFailuresToTolerate" i0) ("forceProvisioning" i0) ("spbmProfileId" "e8c34147-####-####-####-########242") ("spbmProfileGenerationNumber" l+2) ("objectVersion" i13) ("replicaPreference" "Capacity") ("iopsLimit" i0) 
("checksumDisabled" i0) ("subFailuresToTolerate" i1) ("CSN" l8548) ("SCSN" l8548) ("spbmProfileName" "Default VMC VM Storage Policy") ("locality" "None"))

2021-10-13T00:47:44.576Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_IsHFTPreserved: preConfigActive: 2, postActiveAndTransient: 2, preUFT/LFT: 0/1, postUFT/LFT: 0/0
2021-10-13T00:47:44.576Z warning clomd[2098747] [Originator@6876 opID=1804290700] CLOM_ComplianceSanityCheck: Config is non-compliant. Failed to fix the config.
2021-10-13T00:47:44.576Z error clomd[2098747] [Originator@6876 opID=1804290700] CLOM_FixObjectWrapper: StepFix failed for object ########-####-####-####-########b688: Failure

2021-10-13T00:47:44.576Z error clomd[2098747] [Originator@6876 opID=1804290700] CLOMReconfigure: exit: obj ########-####-####-####-########b688 transiantCapGenerated - total: 0, site1: 0, site2: 0, 
workItem type CLEANUP configDelay 0 newConfigGenerated 0 status Failure


Environment

VMware vSAN 7.0.x
VMware vSAN 8.0.x

Cause

The object has a DELTA subtree with both Base and Delta components in ACTIVE STATE. When CLOM tries to clean up such config it fails and logs as:

    2021-10-13T00:47:44.576Z warning clomd[2098747] [Originator@6876 opID=1804290700] CLOM_ComplianceSanityCheck: Config is non-compliant. Failed to fix the config.

CLOM failed the cleanup operation because it thought SubFailuresToTolerate was being reduced.
2021-10-13T00:47:44.576Z info clomd[2098747] [Originator@6876 opID=1804290700] CLOM_IsHFTPreserved: preConfigActive: 2, postActiveAndTransient: 2, preUFT/LFT: 0/1, postUFT/LFT: 0/0


Resolution

Don't use a storage policy in a stretch cluster with Site Disaster tolerance set to either "None - standard cluster" or "None - stretched cluster" with a Failures to tolerate set to RAID1/5/6

Steps to remediate if already hit this issue

  1. Check the storage/capacity usage on each site/fault domain of the Stretch Cluster by selecting the vSAN cluster > Configure > vSAN > Fault Domains
  2. Add a host to a site /fault domain that has less usage if "affected Object(s)" size > Space remaining.
  3. Identify the VMs/VMDKs which reported "OBJECT_STATE_PENDING_RESYNC" from clomd.log by running: cat  /var/log/clomd.log |grep OBJECT_STATE_PENDING_RESYNC|less
  4. Create a new policy from "VM Storage Policies" with "Site disaster tolerance = None - keep data on Preferred" (If the Preferred site has more capacity) and apply this to the affected VMDK/object.
    1. Ex: If the object/VMDK has Site disaster tolerance set to "None - stretched cluster" and Failures to tolerate is set to "1 failure - RAID-1 (Mirroring)"; change/create a new policy with Site disaster tolerance to None - keep data on Preferred (stretched cluster) and Failures to tolerate to 1 failure - RAID-1 (Mirroring)" and apply this to the affected VMDK/object.
  5. Wait for resync to complete 
  6. Repeat steps 1-4 on all the objects which reported "OBJECT_STATE_PENDING_RESYNC" from clomd.log in batches.
  7. Re-attempt putting the host into maintenance mode after resync is complete on all reported objects.

Notes:

Changing policies on all affected objects/VMDKs from a cluster would result in a cluster-wide huge resync; hence, it is recommended to identify the culprit objects and apply the above fix.