Supervisor Cluster Stuck in Configuring State and the Control Plane marked as Orphaned in vSphere UI

Products

VMware vSphere Kubernetes Service

Issue/Introduction

A Supervisor cluster remains in a permanent Configuring state and have noticed one of the three Supervisor control plane nodes appears as Orphaned within the vSphere Client.
When accessing the Supervisor cluster via SSH and checking node status, only two out of the three expected nodes are listed. Furthermore, etcd health checks confirm that the cluster is running with reduced redundancy, missing one member.

The following etcdctl output highlights the missing third node:

root@<Node UUID> [ ~ ]# etcdctl member list -w table
+------------------+---------+----------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |               NAME               |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------------------------------+---------------------------+---------------------------+------------+
| ################ | started | <Node ID>                        | https://<FQDN>:2380       | https://<FQDN>:2379       |      false |
| ################ | started | <Node ID>                        | https://<FQDN>:2380       | https://<FQDN>:2379       |      false |
+------------------+---------+----------------------------------+---------------------------+---------------------------+------------+

root@<Node UUID> [ ~ ]# etcdctl --cluster = <Cluster ID> endpoint health -w table
+---------------------------+--------+------------+-------+
|          ENDPOINT         | HEALTH |    TOOK    | ERROR |
+---------------------------+--------+------------+-------+
| https://<FQDN>:2379       |   true | 4.131421ms |       |
| https://<FQDN>:2379       |   true | 6.279624ms |       |
+---------------------------+--------+------------+-------+

Environment

VMware vSphere Kubernetes Service

Cause

The issue occurs when an ESXi host and its associated backend storage which hosted a Supervisor control plane VM are decommissioned from the cluster without migrating (vMotioning) the VM. This action leaves the ESXi Agent Manager (EAM) agency in an unhealthy state. Because the underlying storage and host are gone, EAM cannot recover or restart the orphaned VM preventing the Supervisor cluster from reaching a Ready state.

Resolution

To resolve this issue, you must manually delete the Agency associated with the orphaned VM through the vSphere UI. This allows the Supervisor controller to trigger a fresh deployment of the missing node.

IMPORTANT: Please ensure a fresh backup of the vCenter Server is available before proceeding.

Detailed Step-by-Step Instructions:

Identify the Orphaned VM: In the vSphere Client inventory, locate the Supervisor control plane VM marked as (orphaned).
- Note the specific name of the VM (e.g., vCLS-XXXX or the Supervisor node name).
Navigate to ESXi Agency Manager:
- In the vSphere Client, go to Administration > Solutions > vCenter Server Extensions.
- Select vSphere ESXi Agent Manager (EAM).
- Click on the Agencies tab.
Delete the Stale Agency:
- Browse the list of agencies to find the one associated with the orphaned Supervisor node. Agencies are typically named based on the solution they support (e.g., vSphere with Tanzu).
- Look for the agency showing an Error or Yellow/Red status corresponding to the missing host/VM.
- Select the agency, click More Actions, and choose Delete Agency.
- Note: Ensure you are only deleting the specific agency for the failed node, not the entire Supervisor deployment agency.
Remove the Orphaned VM from Inventory:
- Return to the Hosts and Clusters view.
- Right-click the orphaned Supervisor control plane VM and select Remove from Inventory.
Trigger Re-configuration:
- Once the stale agency and orphaned VM are removed, the Supervisor Cluster controller will detect the discrepancy.
- vCenter will automatically initiate a "Reconfigure" task to deploy a new control plane VM on an available host and datastore within the cluster.
Monitor Progress:
- Monitor the Tasks and Events tab.
- Verify that a new VM is successfully deployed and that the Supervisor Cluster status transitions back to Ready.
- Re-run the etcdctl member list command via SSH to confirm all 3 nodes are present and healthy.

Additional Information

Always ensure that all System VMs (including Supervisor Control Plane VMs) are migrated to active hosts and datastores before performing host decommissioning or storage maintenance.