vSphere Kubernetes Cluster Upgrade Stuck with Control Planes Upgraded but Worker Nodes Stuck Upgrading due to MachineDeployment Version

search cancel

vSphere Kubernetes Cluster Upgrade Stuck with Control Planes Upgraded but Worker Nodes Stuck Upgrading due to MachineDeployment Version

book

Article ID: 376919

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

vSphere Kubernetes Cluster upgrade is stuck and not progressing.

While connected to the Supervisor context, the following symptoms are present:

-All control plane nodes were successfully upgraded to the desired version

-Worker nodes are stuck upgrading and nodes still show the previous version

-A worker node is continuously getting recreated on the previous version every 15 minutes

-Describing the cluster object notes that the machinedeployments upgrade to the desired version is on hold and that the machinedeployments are rolling out

-The machinedeployment for the recreating worker nodepool shows the previous version

-Describing the vspheremachinetemplate associated with the machinedeployment for the recreating worker nodepool shows the previous version's image

While connected to the affected cluster's context, the following symptoms are present:

-The recreating worker node shows NotReady state on the previous version

-Pods for antrea, kube-proxy and vsphere-csi-node are in ImagePullBackOff state due to missing images for the desired upgrade version

While connected to the recreating worker node on the previous version, the following symptoms are present:

-Containerd and kubelet are in a healthy state and Running

-Containerd and kubelet logs show repeated errors for "failed to pull and unpack image":"failed to resolve reference" regarding image versions associated with the desired upgrade version

-"crictl images list" show image versions for the previous TKR version and do not have any images with versions for the desired upgrade version

VMware Tanzu Kubernetes Release Notes contains tables for each TKR version and its expected package image versions:

Environment

vSphere with Tanzu 8.0U2 and lower

This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

Worker nodes are continuously recreating and failing health checks because no containers are able to reach Running state. The containers cannot start up properly because the expected desired upgraded version images are not present on the worker nodes. These images are not present on worker nodes because the worker nodes are still on the older version.

Machinehealthchecks and machinedeployments cannot properly reconcile due to the following Kubernetes issue: https://github.com/kubernetes-sigs/cluster-api/issues/7533

This issue can occur when an upgrade was initiated and completed for the control planes on a cluster when the previous rolling redeployment had yet to finish which leaves the machinedeployments and vspheremachinetemplates on the older version. The previous rolling redeployment could have been due to an incomplete upgrade or a change that had yet to complete rolling out to all nodes in the cluster.

Resolution

If this is a workload cluster with a TKC, confirm that both the TKC and cluster object are on the same desired TKR version:

Connect into the Supervisor Cluster context

Compare the versions of the TKC and cluster object:

kubectl get tkc,cluster -n <affected workload cluster namespace>

If the TKC is still on the previous TKR version, update the TKC to the correct, desired version.
- The TKC version must be updated to initiate a workload cluster upgrade
- The cluster object's version will be updated automatically to match the TKC.

Otherwise, please open a ticket to VMware by Broadcom Technical Support referencing this KB article for assistance in creating a vspheremachinetemplate object and updating necessary components to the desired upgrade version.

Additional Information

A fix for this issue was made available in vSphere 8.0U3.

Feedback

thumb_up Yes

thumb_down No