DRS functionality impacted by unhealthy state of the vSphere Cluster Services (vCLS)

Products

VMware vCenter Server

Issue/Introduction

This article provides information on resolving the vCLS health issues, so that DRS functions correctly in the cluster.

vSphere 7.0 Update 1, vSphere DRS for a cluster depends on the health of vSphere Cluster Services (vCLS). vCLS on a cluster configures a quorum on vCLS system VMs on the cluster. These VMs are necessary to maintain the health of the cluster services. If vCLS health gets impacted due to unavailability of these VMs in a cluster, then vSphere DRS will not be functional in the cluster until the time vCLS VMs are brought back up.

Below are the listed operations that could fail if performed when DRS is not functional. Also, another point to note that below operations on a new DRS enabled cluster will not be available until the first vCLS VM is deployed and powered-on in that cluster.

A new workload VM placement/power-on.
Host selection for a VM that is migrated from another cluster/host within the vCenter.
Migrated VM could get powered-on on a non-DRS selected host.
Placing a host into maintenance mode might get stuck if it has any powered-on VM
Invocation of DRS APIs such as ClusterComputeResource.placeVm() and ClusterComputeResource.enterMaintenanceMode() will get InvalidState.
Configuration of Workload Management, Supervisor Cluster and Tanzu Kubernetes Cluster will fail.

Note: If DRS is not enabled on such a cluster, then the vSphere Cluster health will be in the degraded state. In the vSphere Client UI, errors similar to the below message will be visible:

vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.

For more information, see vSphere Cluster Services (vCLS) in vSphere 7.0 Update 1 .

Environment

VMware vCenter Server 7.0.x

VMware vCenter Server 8.0.x

Cause

There could be multiple issues resulting in this error.

A user has powered off or deleted vCLS VMs from a DRS enabled cluster
vCLS VMs deployment failed
vCLS VMs power on failed
When vCLS is disabled on a cluster using Retreat Mode
HA was unable to failover vCLS VMs upon host or storage failure

Resolution

The below workaround applies to 7.0U1 and higher. For any additional issues, please file a Support Request.

Additional Information

vCLS VMs will automatically be powered on or recreated by vCLS service. These VMs are deployed prior to any workload VMs that are deployed in a green field/fresh deployment. In an upgrade scenario, these VMs are deployed before vSphere DRS is configured to run on the clusters. When all the vCLS VMs are powered-off or deleted, the vSphere Cluster status for that cluster will turn to 'Degraded (Yellow)'. vSphere DRS needs one of the vCLS VMs to be running in a vSphere cluster to be healthy. If DRS runs prior to these VMs are brought back up, then the cluster service will be 'Unhealthy (Red)', until the time vCLS VMs are brought back up.

Scenarios with a resolution where vCLS VMs deployment could fail:

Not enough free resource in the cluster - Requires 400 MHz of CPU, 400 MB of memory and 2 GB of storage space on a cluster with more than 3 hosts. For more information on the resource requirements for these VMs, see the vCLS VM Resource Allocation section of the vSphere Resource Management Guide. vCLS reserves slots equal to the quorum size of the VMs + 1 per cluster. vCLS VMs require this much extra resources in the clusters for successful deployment.
Deployment failures in 1 node and 2 node vSAN cluster - vCLS VMs failed to deploy on a 1 or 2 node vSAN cluster with the error: Can't provision VM for ClusterAgent due to lack of suitable datastore. Since vCLS uses datastore default policy for datastore selection, if vSAN is the only available datastore within the cluster, then default policy requires 3 node vSAN cluster. The deployment of these VMs will fail in such a cluster. If a 2 Node vSAN cluster has a witness node, then deployment of vCLS VM succeeds. Workaround is to increase the size of the vSAN cluster or to change the datastore default policy.
Orphaned VM cases - If there are orphaned vCLS VMs in the vCenter Server because of disconnected and reconnected hosts, deployment of new vCLS VMs in such a cluster after adding the host might fail. Suggested workaround is to clean-up any stale/orphaned vCLS VMs from the inventory.
EAM is unable to validate the STS token - "lack of suitable datastore"

Scenarios with a resolution where vCLS VMs power on may fail

Not enough free resources in the cluster.
Power-on of disconnected/orphaned vCLS VMs could fail - If there are orphaned vCLS VMs in vCenter because of disconnected and reconnected hosts, power-on of such orphaned VMs could fail as these are disconnected. The workaround is to manually delete these VMs so new deployment of vCLS VMs will happen automatically in proper connected hosts/datastores.
Power-on failure due to changes to the configuration of the VMs - If user changes the configuration of vCLS VMs, power-on of such a VM could fail. User is not supposed to change any configuration of these VMs.

ESXi is stuck in maintenance mode at 32% without any Progress due to Stale/Orphaned vCLS VM