vSphere 8.0 Supervisor Workload Cluster Upgrade Stuck with No Nodes on Desired Upgraded Version

Products

VMware vSphere Kubernetes Service vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

In a vSphere 8.0 environment, a vSphere Workload Cluster upgrade is stuck and not progressing.

While connected to the Supervisor context, one or more of the following symptoms are present:

All of the affected cluster's control plane machines are in Healthy, Running state on the previous TKR version
All of the affected cluster's worker machines are on the previous TKR version
A worker node is continuously getting recreated every 5 - 15 minutes and remains in Provisioning state on the previous version:
```
kubectl get machines -n <affected workload cluster namespace>
```
Describing the worker node's corresponding vm does not show HA failover resource issues:
```
kubectl describe vm -n <affected workload cluster namespace> <worker node vm name>
```
Describing the cluster object notes that the Control plane upgrade to the desired version is on hold: Machinedeployment(s) are rolling out:
```
kubectl describe cluster -n <affected workload cluster namespace> <cluster name>

MachineDeployment(s) <cluster machinedeployment> rollout and upgrade to version <desired TKR version> on hold. Control plane is upgrading to version <desired TKR version>
```
This is expected behavior. Worker nodes and nodepools will not upgrade to the new version until all control plane nodes are on the desired version.
The machinedeployments (md) for each worker nodepool shows the previous version:
```
kubectl get md -n <affected workload cluster namespace>
```
This is expected behavior. Worker machinedeployments and nodepools will not upgrade to the new version until all control plane nodes are on the desired version.

While connected to the affected workload cluster's context, the following symptoms are present:

The recreating worker node shows NotReady state on the previous version:
```
kubectl get nodes
```
Pods for antrea, kube-proxy and/or vsphere-csi-node are in Init:ImagePullBackOff state:
```
kubectl get pods -A | grep -v Run
```

Describing one of the pods in Init:ImagePullBackOff state shows an error message similar to the below image error where the missing image's version is the version expected by the cluster's desired upgrade TKR version:

kubectl describe pod -n <pod namespace> <pod name>


Failed to pull image "localhost:5000/vmware.io/<vmware-image>:<image version>": rpc error: code = NotFound desc = failed to pull and unpack image "localhost:5000/vmware.io/<vmware-image>:<image version>": failed to resolve reference "localhost:5000/vmware.io/<vmware-image>:<image version>": localhost:5000/vmware.io/<vmware-image>:<image version>: not found

While SSH to the Provisioning and recreating worker node on the previous version, the following symptoms are present:

Containerd and kubelet are in a Healthy state and show active (Running) state:
```
systemctl status containerd

systemctl status kubelet
```

Containerd and kubelet logs show repeated error messages similar to the below in reference to missing images expected by the cluster's desired upgrade TKR version:

"failed to pull and unpack image":"failed to resolve reference" "localhost:5000/vmware.io/<vmware-image>:<image version>": localhost:5000/vmware.io/<vmware-image>:<image version>: not found

The images present on the node are for the previous TKR version. There are no images present on versions expected for the desired upgrade TKR version:
```
crictl images list
```

VMware Tanzu Kubernetes Release Notes for 8.X contains tables for each TKR version and its expected package image versions:

TKR Release Notes

Environment

vSphere Supervisor 8.0

The VKS/TKG supervisor service 3.3 or lower installed, or VKS/TKG service supervisor service is not yet installed.

This can occur on a vSphere workload cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

Worker nodes are continuously recreating and failing health checks because no containers are able to reach Running state. The containers cannot start up properly because the expected desired upgraded version images are not present on the worker nodes. These images are not present on worker nodes because the worker nodes are still on the older version.

This issue can occur when an upgrade was initiated but a change made to the cluster has not yet completed. Upgrades begin with rolling redeployments of the control plane nodes, but in this scenario, the control plane nodes are waiting for the worker node change to complete. However, the worker node change cannot complete because the recreating worker node stuck in Provisioning state is referencing images that are not present in the node. These images are not present because the upgrade process is searching for the desired upgrade version of images to deploy the necessary system pods, but only the previous version's images are available.

The above same scenario can occur if any changes requiring redeployment or deletions are performed on cluster's nodes after an upgrade is initiated and before the upgrade has finished upgrading the control planes.

Manual deletions during a stuck workload cluster upgrade can also result in this scenario even if the deleted node returns on the desired version because the system has not been able to initiate the workload cluster upgrade due to other issues in the environment. It is not an appropriate step to perform manual deletions on nodes.

This issue can also occur if a workload cluster upgrade was initiated before the system-initiated, mandatory migration from vSphere 7 to vSphere 8.

See How to Identify if a vSphere Kubernetes Cluster is Undergoing Migration from vSphere 7.X to vSphere 8.X

Similar symptoms can occur if there is a third party webhook in the environment.

See vSphere Workload Cluster Nodes Recreating or Cluster Upgrade Stuck - Antrea/Calico CNI not initialized due to Third Party Webhook

Invalid upgrade paths can also lead to this scenario.

VKR Release Notes will note upgrade recommendations and invalid upgrade paths.
See the Upgrade Path Interoperability Matrix for vSphere Kubernetes Releases (VKR) regarding valid upgrade paths.

Resolution

Note: This is only regarding a vSphere 8.0 environment for a cluster where none of the nodes have upgraded.

If this is a workload cluster with a TKC, confirm that both the TKC and cluster object are on the same desired TKR version:

Connect into the Supervisor Cluster context

Compare the versions of the TKC and cluster object:

kubectl get tkc,cluster -n <affected workload cluster namespace>

If the TKC is still on the previous TKR version, update the TKC to the correct, desired version.
- The TKC version must be updated to initiate a workload cluster upgrade
- The cluster object's version will be updated automatically to match the TKC.

If there are third party webhooks in the affected workload cluster, see the below KB article:

vSphere Workload Cluster Nodes Recreating or Cluster Upgrade Stuck - Antrea/Calico CNI not initialized due to Third Party Webhook

Otherwise, please open a ticket to VMware by Broadcom Technical Support referencing this KB for assistance.

This KB article's steps involve administrator privileges for interacting with critical system services and should not be performed outside of Technical Support.

Provide information on the following:

The previous TKR version and desired TKR version.
Status of all node objects in the affected workload cluster from the Supervisor cluster context:
```
kubectl get tkc,cluster,kcp,md,machine -n <SUPERVISOR-NAMESPACE>
```
The VKS/TKG service status and version in the environment:
- See Verify the VKS Status documentation
Whether or not the affected workload cluster is migrating or not:
- See How to Identify if a vSphere Kubernetes Cluster is Undergoing Migration from vSphere 7.X to vSphere 8.X

Additional Information

See the Upgrade Path Interoperability Matrix for vSphere Kubernetes Releases (VKR) regarding valid upgrade paths.