vSphere Kubernetes cluster upgrade is stuck and not progressing. There are no nodes that are on the desired upgrade version.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get tkc -n <affected cluster namespace>
kubectl describe cluster -n <affected cluster namespace> <affected cluster name>
kubectl get machines -n <affected cluster namespace>
kubectl get kcp -n <affected cluster namespace>
Depending on the unhealthy state of the control plane node, one or more of the following symptoms may be present:
kubectl get kcp -n <affected cluster namespace>
kubectl describe kcp -n <affected cluster namespace> <affected cluster's kcp object name>
Waiting for control plane to pass preflight checks to continue reconciliation [machine my-control-plane-node-abc1 reports ControllerManagerPodHealthy condition is unknown (Failed to get pod status)
machine my-control-plane-node-abc1 reports EtcdPodHealthy condition is false
could not establish a connection to any etcd node: unable to create etcd client
kubectl get nodes
vSphere 7.0 with Tanzu
vSphere 8.0 with Tanzu
This issue can occur regardless of whether or not this cluster is managed by TMC.
vSphere Kubernetes Cluster upgrades perform rolling redeployments beginning with the control plane nodes. However, an upgrade will not proceed if any of the control plane nodes in the cluster are detected as unhealthy. If there are no other issues in the environment, the upgrade will proceed to upgrade the control plane nodes once all control plane nodes are restored to a healthy state.
After all control plane nodes are successfully upgraded and in a healthy state, the worker node pools will prepare to upgrade to the desired version.
Documentation on Rolling Updates: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/8-0/using-tkg-service-with-vsphere-supervisor/updating-tkg-service-clusters/understanding-the-rolling-update-model-for-tkg-service-clusters.html
IMPORTANT: It is not a recommended troubleshooting step to delete nodes in an attempt to progress an upgrade. Doing so may lead to scenarios where there is an image conflict on the recreated nodes, leaving them in an unhealthy and inoperable state. This image conflict is caused by the new node searching for images from the desired upgrade version but as the upgrade as not progressed to the node yet, the node only has the previous version's images available.
In an unhealthy environment, deletion of nodes will not recreate the node, worsening the situation and potentially rendering the whole cluster inoperable.
If any worker nodes have been deleted during a stuck upgrade and found to be recreated on the older TKR version, please reach out to VMware by Broadcom Technical Support referencing this KB article for help in progressing the upgrade.
Note: Upgrades must be performed sequentially. Skipping a major version is not supported. Documentation: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/8-0/updating-vsphere-supervisor/updating-the-vsphere-with-tanzu-environment/how-vsphere-iaas-contro-plane-updates-work.html
As the upgrade will not proceed as long as at least one control plane node is in an unhealthy state, the unhealthy control plane node(s) will need to be investigated and restored to a healthy state in order to resume the upgrade process. The below steps provide information on how to diagnose the cause of the unhealthy control plane node(s) in the affected cluster:
kubectl get kcp -n <affected cluster namespace>
kubectl describe kcp -n <affected cluster namespace>
machine my-control-plane-node-abc1 reports EtcdPodHealthy condition is false
could not establish a connection to any etcd node: unable to create etcd client
./certmgr tkc certificates list -n <affected cluster namespace> <affected cluster name>
kubectl get nodes
kubectl get pods -A -o wide | grep <NotReady control plane node name>
kubectl describe pod <unhealthy pod> -n <unhealthy pod's namespace>
kubectl get pkgi -A
kubectl describe pkgi <unhealthy pkgi> -n <unhealthy pkgi namespace>
kubectl get pods -A | grep <antrea/calico>
kubectl get ds -A | grep <antrea/calico>