Telco Cloud Automation (2.3) control plane upgrade times out with custom resources stuck in Provisioning.
search cancel

Telco Cloud Automation (2.3) control plane upgrade times out with custom resources stuck in Provisioning.


Article ID: 325374


Updated On:


VMware VMware Telco Cloud Automation


This workaround allows a user to address the issue with the control plane getting stuck during the upgrade in TCA 2.3.

  1. Telco Cloud Automation (TCA) upgraded from version 2.2.0 to 2.3.0
  2. When upgrading a TCA workload cluster to v1.24.10+vmware.1, the control plane upgrade times out.
  3. When the run kubectl get node command is ran from the master node of the workload cluster, we observe a newly created control plane node listing the old version, as seen below in bold:

    capv@workload-control-plane-blg5r [ ~ ]$ kubectl get node

    NAME                                    STATUS   ROLES            AGE   VERSION

    workload-np1-66b84fc879-rvhq8           Ready    <none>           27h   v1.23.16+vmware.1

    workload-np2-5479c5d85c-7qglk           Ready    <none>           27h   v1.23.16+vmware.1

    workload-control-plane-4tvtz            Ready    control-plane    27h   v1.23.16+vmware.1

    workload-control-plane-98z9f            Ready    control-plane    27h   v1.23.16+vmware.1

    workload-control-plane-blg5r            Ready    control-plane    27h   v1.23.16+vmware.1

    workload-control-plane-stm1a            Ready    control-plane    5m    v1.23.16+vmware.1


VMware Telco Cloud Automation 2.3


Since TCA 2.1, due to a known issue upstream, control plane upgrades must be performed in two steps. 
If the first upgrade step fails, the upgrade case targeting Kubernetes version v1.24.10+vmware.1 will become blocked.


This issue is addressed in TCA 3.0.
The following workaround will permanently address the issue in TCA 2.3.

  1. Login to the master node of management cluster that is managing the failed workload cluster.
  2. Find the the capi-kubeadm-control-plane-controller-manager pod by running the following command:
    kubectl get pod -n capi-kubeadm-control-plane-system
  3. Run the following command to find the problematic Machine name:
    kubectl logs -n capi-kubeadm-control-plane-system <POD-NAME> |grep <WORKLOAD-CLUSTER-NAME>

    The expected output appears as follows:
    kubectl logs -n capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-779c9fbb79-djshh |grep workload
    failed to reconcile certificate expiry for Machine/workload-control-plane-stm1a
  4. Run the following command to delete the machine:
    kubectl delete machine -n <WORKLOAD-CLUSTER-NAME> <MACHINE-NAME>

    kubectl delete machine -n workload workload-control-plane-stm1a

  5. Obtain the TcaKubeControlPlane custom resource and check if the follow error message exists in status using the following command:
    kubectl get tkcp -n <WORKLOAD-CLUSTER-NAME> <TKCP-NAME> -oyaml

    Sample error message:
    The kubernetes version of cluster workload is inconsistent, need upgrade TcaNodePool workload-np1 from [v1.23.16+vmware.1] to [v1.24.10+vmware.1].
  6. Run the following command to correct the incorrect state:
    kubectl patch tkcp -n <WORKLOAD-CLUSTER-NAME> <TKCP-NAME> -p '{"status":{"kubernetesVersion":"<The previous kubernetes version>"}}' --type=merge --subresource status

    kubectl patch tkcp -n workload workload-control-plane  -p '{"status":{"kubernetesVersion":"v1.23.16+vmware.1"}}' --type=merge --subresource status
  7. Run the following two commands to delete the TCA cluster operator pod and trigger the update to take effect immediately:
    kubectl get pod -n tca-system |grep tca-kubecluster-operator
    kubectl delete pod -n tca-system <POD-NAME>

    $ kubectl get pod -n tca-system |grep tca-kubecluster-operator
    tca-kubecluster-operator-6dc9d5d58f-cmdqf   1/1     Running   0          24m
    $ kubectl delete pod -n tca-system tca-kubecluster-operator-6dc9d5d58f-cmdqf
    pod "tca-kubecluster-operator-6dc9d5d58f-cmdqf" deleted