Manually Destroy Embedded vCLS on an 8.0 U3 ESXi Host

Products

VMware vCenter Server 8.0 VMware vSphere ESXi 8.0

Issue/Introduction

The lifetime of Embedded vCLS VMs is generally managed automatically. However, there may be cases where these mechanisms fail and a VM may undesirably remain present on an ESX host. In such cases, an administrator may need to use the ESX shell to manually remove the unwanted VM. This article explains the process.

Please review the below use cases where this article is applicable:

VM fails to destroy during EnterMaintenanceMode or EnterStandbyMode: These modes need to wait for all VMs to be powered-off before they can be fully entered. If the host cannot modify the desired state and stop the VM in a timely manner, the operation may get stuck for a long time or time out. Intervention may be needed to unblock the host's ability to enter these modes.
VM lingers after other operations that should destroy it: Disconnect, Removal, Anti-Affinity, and Retreat Mode at all best-effort removals. If the host cannot modify the desired state and stop the VM, it will carry on with the operation while leaving the VM running. This can leave the VM running while the host is in a state where it normally shouldn't be present. These VMs could be obtrusive since they are no longer managed and cannot be reliably deleted from vCenter.
vCenter is downgraded after deploying Embedded vCLS: If a vCenter running 8.0 update 3 or later is downgraded to 8.0 update 2 or earlier (due to a problem discovered after upgrade, for example), the older version will not be aware of Embedded vCLS. It may identify them as workload VMs or External vCLS VMs, will be unable to modify the desired state in order to destroy them, and may encounter other issues due to them having no datastores.

Environment

vCenter Server 8.0 U3

ESXi 8.0 U3

Cause

In most cases, Embedded vCLS VMs are automatically destroyed in cases where they aren't desired. This includes putting the host into Maintenance Mode or Standby Mode, Disconnecting it, Removing it from the cluster or inventory, setting an Anti-Affinity rule for it (as long as one other host is available without such a rule), and enabling Retreat Mode on the cluster. They may not be cleaned up if a host is removed while it is Not Responding, or if a cluster is destroyed directly with Embedded vCLS VMs present. However, re-adding an affected host to a supported vCenter as Standalone should clean up the lingering VM.

Despite these safeguards, there may be cases where an Embedded vCLS VM fails to be destroyed. There may be a problem on the host-side causing the destroy operations to fail, or the host is being added to a vCenter that isn't vCLS-aware. In these cases, the VM may be stuck in place. If this occurs, an administrator needs to manually destroy the VM on the affected host.

Attempting to power-off an Embedded vCLS VM via inventory operations will usually destroy the current instance of the VM, which can clear up some transient issues. However, it does not affect the desired state of the VM being running, so the host will quickly deploy a new instance in its place.

Resolution

The following steps can be used to manually destroy an Embedded vCLS VM on ESX 8.0 update 3. These actions are performed using the ESXi Shell, typically over SSH.

Warning: This guide does not apply to Embedded vCLS VMs that are running in an intended state. Intended state meaning on a host that is supported, connected, available, not entering Maintenance Mode, and connected to a support vCenter in a cluster where Retreat Mode and Anti-Affinity aren't excluding the host from running the VM. In such cases, attempting to destroy it using this method will cause it to be re-deployed shortly or cause vCenter and ESXi to get out of sync regarding the VM's state.

These steps are safe to perform if the host is stuck entering Maintenance Mode according to vCenter. However, it is not recommended to perform these steps while the host is already in Maintenance Mode. A state with Embedded vCLS running while in Maintenance Mode is difficult to achieve but possible. If it is the case, take the host out of Maintenance Mode.

Verify that the infravisor service is running.
- Check the status.
  
  [root@localhost:~] /etc/init.d/infravisor status
  
  infravisor is running
  
  # If the output was "infravisor is not running", start the service
  
  [root@localhost:~] /etc/init.d/infravisor start
Disable the desired state configuration for Embedded vCLS.
- Read the configuration.
  
  [root@localhost:~] configstorecli config current get -c esx -g infravisor_pods -k vcls
  
  {
  
     "pod_settings": {
  
        "enabled": true,
  
        ...
  
     }
  
  }
- If the configuration was empty, skip to step 3.
- If the value of pod_settings.enabled is set to true, update it to false.
  
  [root@localhost:~] configstorecli config current set -c esx -g infravisor_pods -k vcls -p /pod_settings/enabled -v false
  
  Set: completed successfully

Check whether Step 2 destroyed the Embedded vCLS VM.

Confirm that the pod is not running anymore.

# Success

[root@localhost:~] inf-cli get pods -n vcls

No Pods found in namespace: vcls

# Failure

[root@localhost:~] inf-cli get pods -n vcls

NAMESPACE   NAME                                        STATUS    REASON   IP_ADDRESS

vcls vcls-420fb029-6faa-4319-14d4-58b5a10954c2 RUNNING N/A N/A

If no pods were found, skip to step 4.
If a pod was found, kill it.

[root@localhost:~] inf-cli kill -p /etc/vmware/infravisor/manifests/vcls.yaml

Killed podVM for pod vcls/vcls-420fb029-6faa-4319-14d4-58b5a10954c2

Confirm that the VM is removed from inventory.
- Using either the vCenter or ESX web client, verify that the Embedded vCLS VM has been removed.
- If the VM remains present, ensure that the host is taken out of Maintenance Mode, and wait a minute or two.
- At this point, the Embedded vCLS VM should be gone.