TKG Workload cluster upgrade failed as "updateStalled"
search cancel

TKG Workload cluster upgrade failed as "updateStalled"

book

Article ID: 369405

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

  • TKG upgrade got stuck and after waiting for few hours the CAPi/CAPv pods were restarted. This resulted in ControlPlane nodes getting upgraded but the worker nodes still remain in the older version.
  • The restart of the CAPi/CAPv did not provision new worker machines and the upgrade got stuck.

  • Cluster is marked as "updateStalled" via Tanzu CLI.

$ tanzu cluster list -A
NAME                 NAMESPACE   STATUS         CONTROLPLANE  WORKERS  KUBERNETES
workload-cluster-A   default     updateStalled  3/3           3/3      v1.28.7+vmware.1

  • Control-plane nodes are upgraded. However, Worker nodes are still in the older version.

$ kubectl get node  
NAME                    STATUS   ROLES           AGE    VERSION
tanzu-control-plane-A   Ready    control-plane   21h    v1.28.7+vmware.1
tanzu-control-plane-B   Ready    control-plane   21h    v1.28.7+vmware.1
tanzu-control-plane-C   Ready    control-plane   21h    v1.28.7+vmware.1
tanzu-worker-node-A     Ready    <none>          167d   v1.27.5+vmware.1
tanzu-worker-node-B     Ready    <none>          167d   v1.27.5+vmware.1
tanzu-worker node-C     Ready    <none>          167d   v1.27.5+vmware.1 

  • Error found in CAPi controller logs:

$ kubectl -n capi-system logs deployments/capi-controller-manager | grep $WORKLOAD_CLUSTER_NAME
MMDD: 1 machineset_controller.go:439] "MachineSet is scaling up to 1 replicas by creating 1 machines" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="default/<Machine-set name>" namespace="default" name="<Machine-set name>" reconcileID=########-####-####-####-############ MachineDeployment="default/<MachineDeployment-name>" Cluster="default/tanzu" replicas=1 machineCount=0
MMDD: 1 machineset_preflight.go:140] "Performing \"Scale up\" on hold because MachineSet version (1.27.5+vmware.1) and ControlPlane version (1.28.7+vmware.1) do not conform to kubeadm version skew policy as kubeadm only supports joining with the same major+minor version as the control plane (\"KubeadmVersionSkew\" preflight failed). The operation will continue after the preflight check(s) pass" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="default/<Machine-set name>" namespace="default" 

Environment

  • Tanzu Kubernetes Grid 2.3.0
  • Tanzu Kubernetes Grid 2.4.1
  • Tanzu Kubernetes Grid 2.5.1

Cause

  • During the upgrade due to multiple reasons(No IP address left, network connectivity issues) the upgrade of ControlPlanes takes considerable time to complete, This results in updateStalled status and only the ControlPlane nodes are upgraded to higher version.

$ tanzu cluster list -A
NAME                 NAMESPACE   STATUS         CONTROLPLANE  WORKERS  KUBERNETES
workload-cluster-A   default     updateStalled  3/3           2/3      v1.28.7+vmware.1

  • As the result of the upgradeStalled the TKR version is updated to the next version. This prevents CAPi to successfully reconcile the remaining nodes. 

$ kubectl get cluster ${WORKLOAD_CLUSTER_NAME} -oyaml | yq .metadata
...  
labels:
    tanzuKubernetesRelease: v1.28.7+vmware.1-tkg.3
    tkg.tanzu.vmware.com/cluster-name: workload-cluster-A

Resolution

This workaround is verified only for the Legacy cluster, not the class-based cluster.
If you encounter this issue with a class-based cluster, please raise a new support case.

 

  1.  Set parameter

    kubectl get md -A
    export CNS='default' # Workload Cluster Namespace
    export MD='workload-slot35rp25-md-0'

  2. Create a new VSphereMachineTemplate that has the right version and template path reference

    VMT_NAME=$(kubectl get machinedeployments $MD -n $CNS -o jsonpath='{.spec.template.spec.infrastructureRef.name}')
    kubectl get vspheremachinetemplates $VMT_NAME -o yaml > vmt_${VMT_NAME}.yaml

    cp vmt_${VMT_NAME}.yaml vmt_${VMT_NAME}-new.yaml

    vim vmt_${VMT_NAME}-new.yaml
        [Modifications]
            metadata.annotations -> [delete]
            metadata.creationTimestamp -> [delete]
            metadata.generation -> [delete]
            metadata.resourceVersion -> [delete]
            metadata.uid -> [delete]
            metadata.name = ${VMT_NAME}-new
            spec.template.spec.template = /Datacenter/vm/photon-5-kube-v1.28.4+vmware.1 # Set target path

    kubectl apply  -f vmt_${VMT_NAME}-new.yaml

  3. Create a new KubeadmConfigTemplate that has the expected config

    KCT_NAME=$(kubectl get machinedeployments $MD -n $CNS -o jsonpath='{.spec.template.spec.bootstrap.configRef.name}')
    kubectl get kubeadmconfigtemplates $KCT_NAME -n $CNS -o yaml > kct_${KCT_NAME}.yaml
    cp kct_${KCT_NAME}.yaml kct_${KCT_NAME}-new.yaml

     vim kct_${KCT_NAME}-new.yaml
        [Modifications]
            metadata.annotations -> [delete]
            metadata.creationTimestamp -> [delete]
            metadata.generation -> [delete]
            metadata.resourceVersion -> [delete]
            metadata.uid -> [delete]
            metadata.name = ${KCT_NAME}-new

    kubectl apply -f kct_${KCT_NAME}-new.yaml

  4. Create a new MachineDeployment

    Make sure it has the correct OwnerReferences to the above objects and the target Cluster.

    kubectl get machinedeployments ${MD} -n $CNS -o yaml > md_${MD}.yaml
    cp md_${MD}.yaml md_${MD}-new.yaml

    vim md_${MD}-new.yaml
    [Modifications]
            metadata.annotations -> [delete]
            metadata.creationTimestamp -> [delete]
            metadata.generation -> [delete]
            metadata.resourceVersion -> [delete]
            metadata.uid -> [delete]
            metadata.name = ${MD}-new
            spec.rolloutAfter -> [delete] *If set up.
            spec.selector.matchLabels.cluster.x-k8s.io/deployment-name = ${MD}-new
            spec.template.metadata.labels.cluster.x-k8s.io/deployment-name = ${MD}-new
            spec.template.metadata.labels.node-pool = Add -new as suffix
            spec.template.spec.bootstrap.configRef.name: ${KCT_NAME}-new
            spec.template.spec.infrastructureRef.name: ${VMT_NAME}-new
            spec.template.spec.version = v1.28.4+vmware.1 # Set target version
            status -> [delete]

    kubectl apply -f md_${MD}-new.yaml

  5. Observations after machine deployment creation

    kubectl get md
    tanzu cluster list

  6. Delete the old problematic MachineDeployment and its VSphereMachineTemplate and KubeadmConfigTemplate

    kubectl config use-context <Context of the target cluster>
    kubectl drain <Nodes in the old node pool> --ignore-daemonsets --delete-emptydir-data

    kubectl config use-context <mgmt-cluster>
    kubectl delete machinedeployments ${MD} -n $CNS
    kubectl delete vspheremachinetemplates ${VMT_NAME} -n $CNS
    kubectl delete kubeadmconfigtemplates ${KCT_NAME} -n $CNS