WCP Supervisor Cluster stuck in "Removing" showing EAM errors

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

Symptoms:

Supervisor Cluster VMs no longer exist in the vSphere inventory.
WCP service logs show the below errors:

# cat /var/log/vmware/wcp/wcpsvc.log

...

2023-06-08T17:31:12.064Z error wcp [licensemonitor/license_event_monitor.go:251] [opID=licenseRefreshMonitor] Supervisor control plane failed: No connectivity to API Master: connectivity , config status REMOVING

2023-06-08T17:31:12.064Z error wcp [common/k8sdeploymentutil.go:38] [opID=#########] Unable to get deployment status of vmware-system-netop/vmware-system-netop-controller-manager. Err: Resource Type ClusterComputeResource, Identifier domain-c8 is not found.

2023-06-08T17:31:12.064Z debug wcp [kubelifecycle/eam_monitor.go:99] [opID=######-#######-####-####-####-##########] Supervisor ######-###-####-####-########## has eam issues [[{178002 *types.Issue {vcente

r.wcp.eam.issue.clusterVmNotDeployed Master EAM Agent with identifier ######-###-####-####-########## could not deployed. See ESX Agent Manager logs for more details.

...

EAM service logs show the below errors:

# cat /var/log/vmware/eam/eam.log

...

java.lang.IllegalStateException: Duplicate key VirtualMachine:vm-######

at java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133) ~[?:1.8.0_345]

at java.util.HashMap.merge(HashMap.java:1255) ~[?:1.8.0_345]

at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320) ~[?:1.8.0_345]

at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[?:1.8.0_345]

at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_345]

at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[?:1.8.0_345]

at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_345]

...

EAM agencies of the Supervisor Cluster VMs might still be present. This can be checked in the vSphere Client under:

Menu -> Administration -> Server Extensions -> vSphere ESX Agent Manager -> Configure

Environment

VMware vSphere 7.0 with Tanzu

Cause

The vCLS VMs are causing the EAM service to malfunction and therefore the removal cannot be completed.

Resolution

By placing the vSphere Cluster in "Retreat Mode", vCLS VMs will get removed and the deletion will proceed successfully.

Workaround:

IMPORTANT NOTE: The next workaround will affect DRS and HA functionality on the vSphere Cluster. Don't proceed until the customer confirms that it's okay to continue. More details can be found in this KB.

The workaround to fix the issue is placing the vSphere Cluster in "Retreat Mode", please follow the steps below:

1. Identify the cluster domain ID:

# dcli +i com vmware vcenter namespacemanagement software clusters list

|-----------|-----------------|-----------------------------------------------|

|cluster |cluster_name | desired_version

|-----------|-----------------|-----------------------------------------------|

|domain-c8| |v1.23.12+vmware.wcp.1-vsc0.0.22-21450060 |

|-----------|-----------------|-----------------------------------------------|

2. Login to the vSphere Client and Navigate to the cluster on which vCLS must be deactivated.

3. Navigate to the vCenter Server Configure tab. Under Advanced Settings, click the Edit Settings button.

4. Add the following entry and set the value to "False":

config.vcls.clusters.domain-c(number).enabled

## NOTE: Use the domain ID gathered in Step 1

5. Restart the EAM service:

root@vcenter_lab [ ~ ]# service-control --restart eam

Successfully restarted service eam

6. Verify that all the vCLS VMs are no longer present in the inventory.

7. Exit Retreat Mode by setting the value to "True" in step 2. Finally, restart the EAM service.

Additional Information

Impact/Risks:

The Supervisor Cluster will get stuck in "Removing".