If TKCs are still undergoing migration, deletion of such TKCs will get stuck.
After upgrading vCenter Server 7.x or 8.x, all existing TKCs undergo CAPW to CAPV migration as documented. See, Understanding the Rolling Update Model for TKG Service Clusters.
Migration of TKCs can get stuck if any of the existing TKC is using an incompatible TKR or there is an issue rolling out the update.
NOTE: TKCs with incompatible/Unsupported TKRs cannot be upgraded and the only course of action is to delete and recreate TKCs with supported TKRs
Please see below an example where TKCs were stranded and stuck during migration when upgrading vCenter Server:
Pre-upgrade environment:
Upgrade to:
The vCenter Upgrade to version 8.0 Update 3 triggered an auto upgrade of the Supervisor Cluster to version 1.26.
However, as the Supervisor Cluster versions of 1.26, 1.27 or 1.28 do not support TKC 1.24.11, the current TKCs with TKR 1.24.11 become unsupported and are left stranded. These TKC's cannot be upgraded to a supported version, and the only course of action is to delete them.
vSphere Kubernetes Service
vCenter Server 7.x
vCenter Server 8.x
vCenter Server was upgraded without validating TKR compatibility.
To validate if a TKC has not migrated, check for the existence of the label "run.tanzu.vmware.com/migrate-tkc" in the TKC
If the label exists, the TKC is considered as not migrated and as such, the cluster needs to be deleted.
The TKC will show the following Message under "Conditions" when describing the tkc object that is stuck upgrading or deleting.
Message: error computing the desired state of the Cluster topology: failed to apply patches: failed to generate patches for patch "nodeLabels": failed to generate JSON patches for "KubeadmConfigTemplate": failed to calculate value for template: failed to render template: "run.tanzu.vmware.com/tkr={{ index (index .TKR_DATA .builtin.machineDeployment.version).labels \"run.tanzu.vmware.com/tkr\" }},run.tanzu.vmware.com/kubernetesDistributionVersion={{ index (index .TKR_DATA .builtin.machineDeployment.version).labels \"run.tanzu.vmware.com/tkr\" }},{{- range .nodePoolLabels }}{{ .key }}={{ .value }},{{- end }}\n": template: tpl:1:28: executing "tpl" at <index (index .TKR_DATA .builtin.machineDeployment.version).labels "run.tanzu.vmware.com/tkr">: error calling index: index of untyped nil
To verify, check if TKC has the migrate label and KCP, MD objects are still on the older version
# k get tkc -n <Supervisor_Namespace> -l 'run.tanzu.vmware.com/migrate-tkc'
NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE
<Guest_Cluster_Name> 1 1 v1.25.13---vmware.1-fips.1-tkg.1.ubuntu 6d12h False True [v1.26.5+vmware.2-fips.1-tkg.1 v1.26.10+vmware.1-fips.1-tkg.1.ubuntu v1.26.12+vmware.2-fips.1-tkg.2.ubuntu v1.26.13+vmware.1-fips.1-tkg.3]
# k get kcp,md -n <Supervisor_Namespace>
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/<Guest_Cluster_Name>-control-plane <Guest_Cluster_Name> true true 1 1 1 0 6d12h v1.24.11+vmware.1-fips.1
NAME CLUSTER REPLICAS READY UPDATED UNAVAILABLE PHASE AGE VERSION
machinedeployment.cluster.x-k8s.io/<Guest_Cluster_Name>-np1-worker-w4qdp <Guest_Cluster_Name> 1 1 1 0 Running 6d12h v1.24.11+vmware.1-fips.1
machinedeployment.cluster.x-k8s.io/<Guest_Cluster_Name>-np2-worker-zl87p <Guest_Cluster_Name> Running 6d12h v1.24.11+vmware.1-fips.1
machinedeployment.cluster.x-k8s.io/<Guest_Cluster_Name>-np3-worker-gf8xm <Guest_Cluster_Name> Running 6d12h v1.24.11+vmware.1-fips.1
kubectl delete tkc <TKC_Name> -n <Supervisor_Namespace>
kubectl delete cluster <TKC_Name> -n <Supervisor_Namespace>
The delete command against the cluster may hang. Use Control+c to terminate the delete operation if required.
Validate that there is a deletion timestamp on the cluster.
# k get tkc -n <Supervisor_Namespace> <Guest_Cluster_Name> -o json | jq '.metadata | has("deletionTimestamp")'
true
# k get cluster -n <Supervisor_Namespace> <Guest_Cluster_Name> -o json | jq '.metadata | has("deletionTimestamp")'
true
Finally unpause the cluster, machinedeployment, and kcp for this cluster.
To unpause the cluster run.
# kubectl patch cluster -n <Supervisor_Namespace> <Guest_Cluster_Name> --type merge -p '{"spec":{"paused": false}}'
cluster.cluster.x-k8s.io/<Guest_Cluster_Name> patched
To unpause the machinedeployment(md) and the kubeadmcontrolplane(kcp) the annotate command can be used.
Note: There may be multiple md's stuck in the deleting state within the cluster.
# kubectl annotate md -n <Supervisor_Namespace> <Machine_deployment_stuck_deleting> cluster.x-k8s.io/paused-
# kubectl annotate kcp -n <Supervisor_Namespace> <KCP_stuck_deleting> cluster.x-k8s.io/paused-