Control plane upgrade times out with custom resources stuck in "Provisioning"

Products

VMware Telco Cloud Automation

Issue/Introduction

This workaround allows a user to address the issue with the control plane getting stuck during the upgrade in TCA version 2.2.0 to 2.3.0

When upgrading a TCA workload cluster to v1.24.10+vmware.1, the control plane upgrade times out.
When the run kubectl get node command is ran from the master node of the workload cluster, we observe a newly created control plane node listing the old version, as seen below in bold:

capv@workload-control-plane-blg5r [ ~ ]$ kubectl get node

NAME STATUS ROLES AGE VERSION

workload-np1-#####-rvhq8 Ready <none> 27h v1.23.16+vmware.1

workload-np2-#####-7qglk Ready <none> 27h v1.23.16+vmware.1

workload-control-plane-4tvtz Ready control-plane 27h v1.23.16+vmware.1

workload-control-plane-98z9f Ready control-plane 27h v1.23.16+vmware.1

workload-control-plane-blg5r Ready control-plane 27h v1.23.16+vmware.1

workload-control-plane-stm1a Ready control-plane 5m v1.23.16+vmware.1

Environment

2.3

Cause

Since TCA 2.1, due to a known issue upstream, control plane upgrades must be performed in two steps.
If the first upgrade step fails, the upgrade case targeting Kubernetes version v1.24.10+vmware.1 will become blocked.

Resolution

Addressed in TCA 3.0.

The following workaround will permanently address the issue in TCA 2.3:

Login to the master node of management cluster that is managing the failed workload cluster.
Find the the capi-kubeadm-control-plane-controller-manager pod by running the following command:
kubectl get pod -n capi-kubeadm-control-plane-system
Run the following command to find the problematic Machine name:
kubectl logs -n capi-kubeadm-control-plane-system <POD-NAME> |grep <WORKLOAD-CLUSTER-NAME>

The expected output appears as follows:
kubectl logs -n capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-#####-djshh |grep workload
failed to reconcile certificate expiry for Machine/workload-control-plane-stm1a
Run the following command to delete the machine:
kubectl delete machine -n <WORKLOAD-CLUSTER-NAME> <MACHINE-NAME>

Sample:
kubectl delete machine -n workload workload-control-plane-stm1a
Obtain the TcaKubeControlPlane custom resource and check if the follow error message exists in status using the following command:
kubectl get tkcp -n <WORKLOAD-CLUSTER-NAME> <TKCP-NAME> -oyaml

Sample error message:
The kubernetes version of cluster workload is inconsistent, need upgrade TcaNodePool workload-np1 from [v1.23.16+vmware.1] to [v1.24.10+vmware.1].
Run the following command to correct the incorrect state:
kubectl patch tkcp -n <WORKLOAD-CLUSTER-NAME> <TKCP-NAME> -p '{"status":{"kubernetesVersion":"<The previous kubernetes version>"}}' --type=merge --subresource status

Sample:
kubectl patch tkcp -n workload workload-control-plane -p '{"status":{"kubernetesVersion":"v1.23.16+vmware.1"}}' --type=merge --subresource status
Run the following two commands to delete the TCA cluster operator pod and trigger the update to take effect immediately:
kubectl get pod -n tca-system |grep tca-kubecluster-operator
kubectl delete pod -n tca-system <POD-NAME>

Sample
$ kubectl get pod -n tca-system |grep tca-kubecluster-operator
tca-kubecluster-operator-#####-cmdqf 1/1 Running 0 24m
$ kubectl delete pod -n tca-system tca-kubecluster-operator-#####-cmdqf
pod "tca-kubecluster-operator-######-cmdqf" deleted