Unable to delete TKCs which have not migrated or TKC deletion is stuck

Products

VMware vSphere Kubernetes Service VMware vCenter Server

Issue/Introduction

If TKCs are still undergoing migration, deletion of such TKCs will get stuck.

After upgrading vCenter Server 7.x or 8.x, all existing TKCs undergo CAPW to CAPV migration as documented. See, Understanding the Rolling Update Model for TKG Service Clusters.

Migration of TKCs can get stuck if any of the existing TKC is using an incompatible TKR or there is an issue rolling out the update.

NOTE: TKCs with incompatible/Unsupported TKRs cannot be upgraded and the only course of action is to delete and recreate TKCs with supported TKRs

Please see below an example where TKCs were stranded and stuck during migration when upgrading vCenter Server:

Pre-upgrade environment:

vCenter Server 7.0 Update 3o
Supervisor Cluster version 1.25
TKCs with TKR 1.24.11

Upgrade to:

vCenter Server 8.0 Update 3
Supervisor Cluster versions 1.26, 1.27 or 1.28.

The vCenter Upgrade to version 8.0 Update 3 triggered an auto upgrade of the Supervisor Cluster to version 1.26.
However, as the Supervisor Cluster versions of 1.26, 1.27 or 1.28 do not support TKC 1.24.11, the current TKCs with TKR 1.24.11 become unsupported and are left stranded. These TKC's cannot be upgraded to a supported version, and the only course of action is to delete them.

Environment

vSphere Kubernetes Service

vCenter Server 7.x

vCenter Server 8.x

Cause

vCenter Server was upgraded without validating TKR compatibility.

Resolution

Message: error computing the desired state of the Cluster topology: failed to apply patches: failed to generate patches for patch "nodeLabels": failed to generate JSON patches for "KubeadmConfigTemplate": failed to calculate value for template: failed to render template: "run.tanzu.vmware.com/tkr={{ index (index .TKR_DATA .builtin.machineDeployment.version).labels \"run.tanzu.vmware.com/tkr\" }},run.tanzu.vmware.com/kubernetesDistributionVersion={{ index (index .TKR_DATA .builtin.machineDeployment.version).labels \"run.tanzu.vmware.com/tkr\" }},{{- range .nodePoolLabels }}{{ .key }}={{ .value }},{{- end }}\n": template: tpl:1:28: executing "tpl" at <index (index .TKR_DATA .builtin.machineDeployment.version).labels "run.tanzu.vmware.com/tkr">: error calling index: index of untyped nil

To verify, check if TKC has the migrate label and KCP, MD objects are still on the older version

# k get tkc -n <Supervisor_Namespace> -l 'run.tanzu.vmware.com/migrate-tkc'

NAME                            CONTROL PLANE   WORKER   TKR NAME                                  AGE     READY   TKR COMPATIBLE   UPDATES AVAILABLE
<Guest_Cluster_Name>            1               1        v1.25.13---vmware.1-fips.1-tkg.1.ubuntu   6d12h   False   True             [v1.26.5+vmware.2-fips.1-tkg.1 v1.26.10+vmware.1-fips.1-tkg.1.ubuntu v1.26.12+vmware.2-fips.1-tkg.2.ubuntu v1.26.13+vmware.1-fips.1-tkg.3]


# k get kcp,md -n <Supervisor_Namespace>

NAME                                                                                            CLUSTER                         INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE     VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/<Guest_Cluster_Name>-control-plane            <Guest_Cluster_Name>            true          true                   1          1       1         0             6d12h   v1.24.11+vmware.1-fips.1

NAME                                                                                CLUSTER                         REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE     AGE     VERSION
machinedeployment.cluster.x-k8s.io/<Guest_Cluster_Name>-np1-worker-w4qdp            <Guest_Cluster_Name>            1          1       1         0             Running   6d12h   v1.24.11+vmware.1-fips.1
machinedeployment.cluster.x-k8s.io/<Guest_Cluster_Name>-np2-worker-zl87p            <Guest_Cluster_Name>                                                       Running   6d12h   v1.24.11+vmware.1-fips.1
machinedeployment.cluster.x-k8s.io/<Guest_Cluster_Name>-np3-worker-gf8xm            <Guest_Cluster_Name>                                                       Running   6d12h   v1.24.11+vmware.1-fips.1

To resolve the issue Delete TKC

Delete the TKC and Cluster object, then unpause the cluster object.

kubectl delete tkc <TKC_Name> -n <Supervisor_Namespace> 

kubectl delete cluster <TKC_Name> -n <Supervisor_Namespace>

The delete command against the cluster may hang. Use Control+c to terminate the delete operation if required.

Validate that there is a deletion timestamp on the cluster.

# k get tkc -n <Supervisor_Namespace> <Guest_Cluster_Name> -o json | jq '.metadata | has("deletionTimestamp")'
true

# k get cluster -n <Supervisor_Namespace> <Guest_Cluster_Name> -o json | jq '.metadata | has("deletionTimestamp")'
true

Finally unpause the cluster, machinedeployment, and kcp for this cluster.

To unpause the cluster run.

# kubectl patch cluster -n <Supervisor_Namespace> <Guest_Cluster_Name>  --type merge -p '{"spec":{"paused": false}}'
cluster.cluster.x-k8s.io/<Guest_Cluster_Name> patched

To unpause the machinedeployment(md) and the kubeadmcontrolplane(kcp) the annotate command can be used.

Note: There may be multiple md's stuck in the deleting state within the cluster.

# kubectl annotate md -n <Supervisor_Namespace> <Machine_deployment_stuck_deleting> cluster.x-k8s.io/paused-

# kubectl annotate kcp -n <Supervisor_Namespace> <KCP_stuck_deleting> cluster.x-k8s.io/paused-

Additional Information

Understanding the Rolling Update Model for TKG Service Clusters.

vSphere Kubernetes Release Interoperability