Control plane upgrade times out with custom resources stuck in "Provisioning"
search cancel

Control plane upgrade times out with custom resources stuck in "Provisioning"

book

Article ID: 325374

calendar_today

Updated On: 10-28-2024

Products

VMware Telco Cloud Automation

Issue/Introduction

This workaround allows a user to address the issue with the control plane getting stuck during the upgrade in TCA version 2.2.0 to 2.3.0

  • When upgrading a TCA workload cluster to v1.24.10+vmware.1, the control plane upgrade times out.
  • When the run kubectl get node command is ran from the master node of the workload cluster, we observe a newly created control plane node listing the old version, as seen below in bold:

capv@workload-control-plane-blg5r [ ~ ]$ kubectl get node

NAME                                    STATUS   ROLES            AGE   VERSION

workload-np1-#####-rvhq8           Ready    <none>           27h   v1.23.16+vmware.1

workload-np2-#####-7qglk           Ready    <none>           27h   v1.23.16+vmware.1

workload-control-plane-4tvtz            Ready    control-plane    27h   v1.23.16+vmware.1

workload-control-plane-98z9f            Ready    control-plane    27h   v1.23.16+vmware.1

workload-control-plane-blg5r            Ready    control-plane    27h   v1.23.16+vmware.1

workload-control-plane-stm1a            Ready    control-plane    5m    v1.23.16+vmware.1



Environment

2.3

Cause

Since TCA 2.1, due to a known issue upstream, control plane upgrades must be performed in two steps. 
If the first upgrade step fails, the upgrade case targeting Kubernetes version v1.24.10+vmware.1 will become blocked.

Resolution

Addressed in TCA 3.0.

The following workaround will permanently address the issue in TCA 2.3:

  1. Login to the master node of management cluster that is managing the failed workload cluster.
  2. Find the the capi-kubeadm-control-plane-controller-manager pod by running the following command:
    kubectl get pod -n capi-kubeadm-control-plane-system
  3. Run the following command to find the problematic Machine name:
    kubectl logs -n capi-kubeadm-control-plane-system <POD-NAME> |grep <WORKLOAD-CLUSTER-NAME>

    The expected output appears as follows:
    kubectl logs -n capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-#####-djshh |grep workload
    failed to reconcile certificate expiry for Machine/workload-control-plane-stm1a
  4. Run the following command to delete the machine:
    kubectl delete machine -n <WORKLOAD-CLUSTER-NAME> <MACHINE-NAME>

    Sample:
    kubectl delete machine -n workload workload-control-plane-stm1a

  5. Obtain the TcaKubeControlPlane custom resource and check if the follow error message exists in status using the following command:
    kubectl get tkcp -n <WORKLOAD-CLUSTER-NAME> <TKCP-NAME> -oyaml

    Sample error message:
    The kubernetes version of cluster workload is inconsistent, need upgrade TcaNodePool workload-np1 from [v1.23.16+vmware.1] to [v1.24.10+vmware.1].
  6. Run the following command to correct the incorrect state:
    kubectl patch tkcp -n <WORKLOAD-CLUSTER-NAME> <TKCP-NAME> -p '{"status":{"kubernetesVersion":"<The previous kubernetes version>"}}' --type=merge --subresource status

    Sample:
    kubectl patch tkcp -n workload workload-control-plane  -p '{"status":{"kubernetesVersion":"v1.23.16+vmware.1"}}' --type=merge --subresource status
  7. Run the following two commands to delete the TCA cluster operator pod and trigger the update to take effect immediately:
    kubectl get pod -n tca-system |grep tca-kubecluster-operator
    kubectl delete pod -n tca-system <POD-NAME>


    Sample
    $ kubectl get pod -n tca-system |grep tca-kubecluster-operator
    tca-kubecluster-operator-#####-cmdqf   1/1     Running   0          24m
    $ kubectl delete pod -n tca-system tca-kubecluster-operator-#####-cmdqf
    pod "tca-kubecluster-operator-######-cmdqf" deleted