Certain Workflows on ESXi 8.0 U3 may cause DRS Downtime in Mixed-Version Clusters

Products

VMware vSphere ESXi 8.0 VMware vCenter Server 8.0

Issue/Introduction

In a cluster that contains both supported (ESXi 8.0 U3 or greater) and unsupported hosts (older than ESXi 8.0 U3), vCenter will attempt to deploy Embedded vCLS. However, if all supported hosts are made unavailable, it will attempt to fall back to the original implementation of External vCLS. This process is not instantaneous and subject to the known issues and edge cases of External vCLS. This may result in a period of time where DRS is unavailable.

The most likely trigger is just after a "vCLS upgrade" scenario. For example, during a rolling upgrade when the first 8.0 U3 host exists Maintenance mode it will start running a single Embedded vCLS VM and the cluster will destroy the previous External vCLS VM. The expectation is that more hosts will follow in the upgrade, allowing redundant Embedded vCLS VMs to be deployed. However, if the 8.0 U3 host is put back into Maintenance Mode, becomes Not Responding, or otherwise made unavailable, then vCLS will become unavailable. vCenter will try to solve this state by downgrading to External DRS.

Detection

A downgrade is evident by the lack of Embedded vCLS VMs and possible presence of External vCLS VMs in a cluster that contains hosts which would support Embedded vCLS but are unavailable.

A downgrade emits an event of type com.vmware.vc.vcls.TransitionedToExternalEvent on the cluster. Conversely, an upgrade emits a com.vmware.vc.vcls.TransitionedToEmbeddedEvent event.

A vCLS outage, whether related to a downgrade or other cause, emits a com.vmware.vc.vcls.DegradedEvent or com.vmware.vc.vcls.NonHealthyEvent event.

Environment

vCenter Server 8.0 U3

ESXi 8.0 U3

Cause

vCenter prefers to deploy Embedded vCLS whenever any available hosts in a cluster support it (running at least ESX 8.0 update 3). For clusters of unsupported hosts (older than ESX 8.0 update 3), it will instead deploy the original "External" version of vCLS. This can lead to situations where a cluster that consists of supported and unsupported hosts can switch versions of vCLS depending on which hosts are available. This is intended to support seamlessly "upgrading" from External to Embedded vCLS. In this case, vCenter is able to wait until the first Embedded vCLS VM becomes available before it deactivates External vCLS and destroys those VMs. However, if all supported hosts become unavailable and leave only unsupported hosts, vCenter will have to perform a "downgrade" to External vCLS, which cannot provide the seamless assurances. Since this step is based on a host becoming unavailable (entering the Maintenance Mode, Standby Mode, Disconnected, or Not Responding States, or being removed from the cluster or inventory), vCenter cannot block for an indeterminate amount of time to wait for External vCLS deployment to occur. This leads to a length of time where no version of vCLS is running, which causes DRS to be unavailable.

Resolution

The common-case scenario is only a temporary outage. It should resolve itself given a short amount of time.

The worst-case scenario requires a downgrade scenario to occur while the cluster is in a state where it does not support External vCLS. This should be rare. If it does occur, the recommended actions are to ensure that hosts have enough resources to deploy External vCLS. Alternatively, the Embedded vCLS supported host can be made available again, which will re-upgrade the cluster and deploy a new Embedded vCLS VM.

Concerned administrators wishing to proactively minimize the chances of disruption can perform one of the following mitigation steps.

Use Retreat Mode to bring down vCLS and DRS explicitly before a rolling upgrade, and only exit it once enough (or all) hosts have finished upgrading. This will minimize the time spent switching between versions during the process.
When the first host finished upgrading to a supported version, keep it in Maintenance Mode until a second host has been upgraded, then take both out of Maintenance Mode together. This provides redundancy immediately, reducing the likelihood of needing to initiate a downgrade.
If the potential of a brief downgrade outage is acceptable but a longer one is not, ensure that hosts have sufficient available resources and bandwidth to VC so that External vCLS can be quickly deployed if needed.