TKG Workload cluster upgrade failed as "updateStalled"

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

TKG upgrade got stuck and after waiting for few hours the CAPi/CAPv pods were restarted. This resulted in ControlPlane nodes getting upgraded but the worker nodes still remain in the older version.
The restart of the CAPi/CAPv did not provision new worker machines and the upgrade got stuck.
Cluster is marked as "updateStalled" via Tanzu CLI.

$ tanzu cluster list -A
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
workload-cluster-A default updateStalled 3/3 3/3 v1.28.7+vmware.1

Control-plane nodes are upgraded. However, Worker nodes are still in the older version.

$ kubectl get node
NAME STATUS ROLES AGE VERSION
tanzu-control-plane-A Ready control-plane 21h v1.28.7+vmware.1
tanzu-control-plane-B Ready control-plane 21h v1.28.7+vmware.1
tanzu-control-plane-C Ready control-plane 21h v1.28.7+vmware.1
tanzu-worker-node-A Ready <none> 167d v1.27.5+vmware.1
tanzu-worker-node-B Ready <none> 167d v1.27.5+vmware.1
tanzu-worker node-C Ready <none> 167d v1.27.5+vmware.1

Error found in CAPi controller logs:

$ kubectl -n capi-system logs deployments/capi-controller-manager | grep $WORKLOAD_CLUSTER_NAME
MMDD: 1 machineset_controller.go:439] "MachineSet is scaling up to 1 replicas by creating 1 machines" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="default/<Machine-set name>" namespace="default" name="<Machine-set name>" reconcileID=########-####-####-####-############ MachineDeployment="default/<MachineDeployment-name>" Cluster="default/tanzu" replicas=1 machineCount=0
MMDD: 1 machineset_preflight.go:140] "Performing \"Scale up\" on hold because MachineSet version (1.27.5+vmware.1) and ControlPlane version (1.28.7+vmware.1) do not conform to kubeadm version skew policy as kubeadm only supports joining with the same major+minor version as the control plane (\"KubeadmVersionSkew\" preflight failed). The operation will continue after the preflight check(s) pass" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="default/<Machine-set name>" namespace="default"

Environment

Tanzu Kubernetes Grid 2.3.0
Tanzu Kubernetes Grid 2.4.1
Tanzu Kubernetes Grid 2.5.1

Cause

During the upgrade due to multiple reasons(No IP address left, network connectivity issues) the upgrade of ControlPlanes takes considerable time to complete, This results in updateStalled status and only the ControlPlane nodes are upgraded to higher version.

$ tanzu cluster list -A
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
workload-cluster-A default updateStalled 3/3 2/3 v1.28.7+vmware.1

As the result of the upgradeStalled the TKR version is updated to the next version. This prevents CAPi to successfully reconcile the remaining nodes.

$ kubectl get cluster ${WORKLOAD_CLUSTER_NAME} -oyaml | yq .metadata
...
labels:
tanzuKubernetesRelease: v1.28.7+vmware.1-tkg.3
tkg.tanzu.vmware.com/cluster-name: workload-cluster-A

Resolution

This workaround is verified only for the Legacy cluster, not the class-based cluster.
If you encounter this issue with a class-based cluster, please raise a new support case.

Set parameter

kubectl get md -A
export CNS='default' # Workload Cluster Namespace
export MD='workload-slot35rp25-md-0'
Create a new VSphereMachineTemplate that has the right version and template path reference

VMT_NAME=$(kubectl get machinedeployments $MD -n $CNS -o jsonpath='{.spec.template.spec.infrastructureRef.name}')
kubectl get vspheremachinetemplates $VMT_NAME -o yaml > vmt_${VMT_NAME}.yaml

cp vmt_${VMT_NAME}.yaml vmt_${VMT_NAME}-new.yaml

vim vmt_${VMT_NAME}-new.yaml
[Modifications]
metadata.annotations -> [delete]
metadata.creationTimestamp -> [delete]
metadata.generation -> [delete]
metadata.resourceVersion -> [delete]
metadata.uid -> [delete]
metadata.name = ${VMT_NAME}-new
spec.template.spec.template = /Datacenter/vm/photon-5-kube-v1.28.4+vmware.1 # Set target path

kubectl apply -f vmt_${VMT_NAME}-new.yaml
Create a new KubeadmConfigTemplate that has the expected config

KCT_NAME=$(kubectl get machinedeployments $MD -n $CNS -o jsonpath='{.spec.template.spec.bootstrap.configRef.name}')
kubectl get kubeadmconfigtemplates $KCT_NAME -n $CNS -o yaml > kct_${KCT_NAME}.yaml
cp kct_${KCT_NAME}.yaml kct_${KCT_NAME}-new.yaml

vim kct_${KCT_NAME}-new.yaml
[Modifications]
metadata.annotations -> [delete]
metadata.creationTimestamp -> [delete]
metadata.generation -> [delete]
metadata.resourceVersion -> [delete]
metadata.uid -> [delete]
metadata.name = ${KCT_NAME}-new

kubectl apply -f kct_${KCT_NAME}-new.yaml
Create a new MachineDeployment

Make sure it has the correct OwnerReferences to the above objects and the target Cluster.

kubectl get machinedeployments ${MD} -n $CNS -o yaml > md_${MD}.yaml
cp md_${MD}.yaml md_${MD}-new.yaml

vim md_${MD}-new.yaml
[Modifications]
metadata.annotations -> [delete]
metadata.creationTimestamp -> [delete]
metadata.generation -> [delete]
metadata.resourceVersion -> [delete]
metadata.uid -> [delete]
metadata.name = ${MD}-new
spec.rolloutAfter -> [delete]　*If set up.
spec.selector.matchLabels.cluster.x-k8s.io/deployment-name = ${MD}-new
spec.template.metadata.labels.cluster.x-k8s.io/deployment-name = ${MD}-new
spec.template.metadata.labels.node-pool = Add -new as suffix
spec.template.spec.bootstrap.configRef.name: ${KCT_NAME}-new
spec.template.spec.infrastructureRef.name: ${VMT_NAME}-new
spec.template.spec.version = v1.28.4+vmware.1 # Set target version
status -> [delete]

kubectl apply -f md_${MD}-new.yaml
Observations after machine deployment creation

kubectl get md
tanzu cluster list
Delete the old problematic MachineDeployment and its VSphereMachineTemplate and KubeadmConfigTemplate

kubectl config use-context <Context of the target cluster>
kubectl drain <Nodes in the old node pool> --ignore-daemonsets --delete-emptydir-data

kubectl config use-context <mgmt-cluster>
kubectl delete machinedeployments ${MD} -n $CNS
kubectl delete vspheremachinetemplates ${VMT_NAME} -n $CNS
kubectl delete kubeadmconfigtemplates ${KCT_NAME} -n $CNS