In a cluster that contains both supported (ESXi 8.0 U3 or greater) and unsupported hosts (older than ESXi 8.0 U3), vCenter will attempt to deploy Embedded vCLS. However, if all supported hosts are made unavailable, it will attempt to fall back to the original implementation of External vCLS. This process is not instantaneous and subject to the known issues and edge cases of External vCLS. This may result in a period of time where DRS is unavailable.
The most likely trigger is just after a "vCLS upgrade" scenario. For example, during a rolling upgrade when the first 8.0 U3 host exists Maintenance mode it will start running a single Embedded vCLS VM and the cluster will destroy the previous External vCLS VM. The expectation is that more hosts will follow in the upgrade, allowing redundant Embedded vCLS VMs to be deployed. However, if the 8.0 U3 host is put back into Maintenance Mode, becomes Not Responding, or otherwise made unavailable, then vCLS will become unavailable. vCenter will try to solve this state by downgrading to External DRS.
Detection
A downgrade is evident by the lack of Embedded vCLS VMs and possible presence of External vCLS VMs in a cluster that contains hosts which would support Embedded vCLS but are unavailable.
A downgrade emits an event of type com.vmware.vc.vcls.TransitionedToExternalEvent
on the cluster. Conversely, an upgrade emits a com.vmware.vc.vcls.TransitionedToEmbeddedEvent
event.
A vCLS outage, whether related to a downgrade or other cause, emits a com.vmware.vc.vcls.DegradedEvent
or com.vmware.vc.vcls.NonHealthyEvent
event.
vCenter Server 8.0 U3
ESXi 8.0 U3
vCenter prefers to deploy Embedded vCLS whenever any available hosts in a cluster support it (running at least ESX 8.0 update 3). For clusters of unsupported hosts (older than ESX 8.0 update 3), it will instead deploy the original "External" version of vCLS. This can lead to situations where a cluster that consists of supported and unsupported hosts can switch versions of vCLS depending on which hosts are available. This is intended to support seamlessly "upgrading" from External to Embedded vCLS. In this case, vCenter is able to wait until the first Embedded vCLS VM becomes available before it deactivates External vCLS and destroys those VMs. However, if all supported hosts become unavailable and leave only unsupported hosts, vCenter will have to perform a "downgrade" to External vCLS, which cannot provide the seamless assurances. Since this step is based on a host becoming unavailable (entering the Maintenance Mode, Standby Mode, Disconnected, or Not Responding States, or being removed from the cluster or inventory), vCenter cannot block for an indeterminate amount of time to wait for External vCLS deployment to occur. This leads to a length of time where no version of vCLS is running, which causes DRS to be unavailable.
The common-case scenario is only a temporary outage. It should resolve itself given a short amount of time.
The worst-case scenario requires a downgrade scenario to occur while the cluster is in a state where it does not support External vCLS. This should be rare. If it does occur, the recommended actions are to ensure that hosts have enough resources to deploy External vCLS. Alternatively, the Embedded vCLS supported host can be made available again, which will re-upgrade the cluster and deploy a new Embedded vCLS VM.
Concerned administrators wishing to proactively minimize the chances of disruption can perform one of the following mitigation steps.