In a vSphere 7.X environment, a vSphere Kubernetes Cluster upgrade is stuck and not progressing.
While connected to the Supervisor context, the following symptoms are present:
kubectl get machines -n <affected cluster namespace>
kubectl describe vm -n <affected cluster namespace> <worker node vm name>
kubectl describe cluster -n <affected cluster namespace> <cluster name>
kubectl get md -n <affected cluster namespace>
While connected to the affected cluster's context, the following symptoms are present:
kubectl get nodes
kubectl get pods -A | grep -v Run
kubectl describe pod -n <pod namespace> <pod name>
Failed to pull image "localhost:5000/vmware.io/<vmware-image>:<vmware-version>": rpc error: code = NotFound desc = failed to pull and unpack image "localhost:5000/vmware.io/<vmware-image>:<vmware-version>": failed to resolve reference "localhost:5000/vmware.io/<vmware-image>:<vmware-version>": localhost:5000/vmware.io/<vmware-image>:<vmware-version>: not found
While connected to the Provisioning and recreating worker node on the previous version, the following symptoms are present:
systemctl status containerd
systemctl status kubelet
"failed to pull and unpack image":"failed to resolve reference" "localhost:5000/vmware.io/<vmware-image>:<vmware-version>": localhost:5000/vmware.io/<vmware-image>:<vmware-version>: not found
crictl images list
VMware vSphere Kubernetes Releases Release Notes contains tables for each TKR version and its expected package image versions:
vSphere Supervisor Services and Standalone Components
vSphere with Tanzu 7.X
This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)
Worker nodes are continuously recreating and failing health checks because no containers are able to reach Running state. The containers cannot start up properly because the expected desired upgraded version images are not present on the worker nodes. These images are not present on worker nodes because the worker nodes are still on the older version.
This issue can occur when an upgrade was initiated but a change made to the cluster has not yet completed. Upgrades begin with rolling redeployments of the control plane nodes, but in this scenario, the control plane nodes are waiting for the worker node change to complete. However, the worker node change cannot complete because the recreating worker node stuck in Provisioning state is referencing images that are not present in the node. These images are not present because the upgrade process is searching for the desired upgrade version of images to deploy the necessary system pods, but only the previous version's images are available.
The above same scenario can occur if any changes requiring redeployment or deletions are performed on cluster's nodes after an upgrade is initiated and before the upgrade has finished upgrading the control planes.
Please note that this is only regarding a vSphere 7.X environment for a cluster where none of the nodes have upgraded. If all control plane nodes have upgraded, please see the following KB:
Confirm that both the TKC and cluster object are on the same desired TKR version:
kubectl get tkc,cluster -n <affected workload cluster namespace>
The TKC version must be updated to initiate a workload cluster upgrade
If there are third party webhooks in the affected workload cluster, see the below KB article:
Otherwise, please open a ticket to VMware by Broadcom Technical Support referencing this KB for assistance.
This KB article's steps involve administrator privileges for interacting with critical system services and should not be performed outside of Technical Support.
Provide information on the following:
kubectl get tkc,cluster,kcp,md,machine -n <SUPERVISOR-NAMESPACE>
This is intended to stabilize the cluster and retry the upgrade once the cluster is in a healthy state with no queued changes requiring redeployment.