vSphere Kubernetes Cluster Upgrade Stuck due to Unhealthy Control Plane Nodes - Waiting to pass preflight checks to continue reconciliation

Products

VMware vSphere Kubernetes Service vSphere with Tanzu

Issue/Introduction

vSphere Kubernetes cluster upgrade is stuck and not progressing. There are no nodes that are on the desired upgrade version.

While connected to the Supervisor cluster context, the following symptoms are observed:

The TKC and/or cluster object for the affected cluster show the desired upgrade version:

kubectl get tkc -n <affected cluster namespace>

kubectl describe cluster -n <affected cluster namespace> <affected cluster name>

The affected cluster is upgrading to a TKR version that is only one version higher than its current version:
- Upgrades must be performed sequentially. Skipping a major version is not supported. Documentation: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/8-0/updating-vsphere-supervisor/updating-the-vsphere-with-tanzu-environment/how-vsphere-iaas-contro-plane-updates-work.html
All machine objects in the affected cluster still show the older TKR version:
- ```
kubectl get machines -n <affected cluster namespace>
```
The kubeadm control plane object (kcp) for the affected cluster shows the desired upgrade version:
- ```
kubectl get kcp -n <affected cluster namespace>
```

Depending on the unhealthy state of the control plane node, one or more of the following symptoms may be present:

While connected to the Supervisor cluster context, performing a describe on the kubeadm control plane object (kcp) returns one or more error messages similar to the below:
- ```
kubectl get kcp -n <affected cluster namespace>

kubectl describe kcp -n <affected cluster namespace> <affected cluster's kcp object name>
```
- ```
Waiting for control plane to pass preflight checks to continue reconciliation [machine my-control-plane-node-abc1 reports ControllerManagerPodHealthy condition is unknown (Failed to get pod status)
```
- ```
machine my-control-plane-node-abc1 reports EtcdPodHealthy condition is false
```
- ```
could not establish a connection to any etcd node: unable to create etcd client
```
- The above error messages indicate that there is an issue with at least one instance of kube-apiserver and etcd within the affected cluster. One instance of kube-apiserver and etcd runs on each control plane node in the cluster. A control plane node will be marked unhealthy if its kube-apiserver and etcd are down or crashing.
From within the affected cluster's context, one or more control plane nodes show NotReady state:
- ```
kubectl get nodes
```

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

This issue can occur regardless of whether or not this cluster is managed by TMC.

Cause

vSphere Kubernetes Cluster upgrades perform rolling redeployments beginning with the control plane nodes. However, an upgrade will not proceed if any of the control plane nodes in the cluster are detected as unhealthy. If there are no other issues in the environment, the upgrade will proceed to upgrade the control plane nodes once all control plane nodes are restored to a healthy state.

After all control plane nodes are successfully upgraded and in a healthy state, the worker node pools will prepare to upgrade to the desired version.

Documentation on Rolling Updates: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/8-0/using-tkg-service-with-vsphere-supervisor/updating-tkg-service-clusters/understanding-the-rolling-update-model-for-tkg-service-clusters.html

Resolution

IMPORTANT: It is not a recommended troubleshooting step to delete nodes in an attempt to progress an upgrade. Doing so may lead to scenarios where there is an image conflict on the recreated nodes, leaving them in an unhealthy and inoperable state. This image conflict is caused by the new node searching for images from the desired upgrade version but as the upgrade as not progressed to the node yet, the node only has the previous version's images available.

In an unhealthy environment, deletion of nodes will not recreate the node, worsening the situation and potentially rendering the whole cluster inoperable.

If any worker nodes have been deleted during a stuck upgrade and found to be recreated on the older TKR version, please reach out to VMware by Broadcom Technical Support referencing this KB article for help in progressing the upgrade.

Note: Upgrades must be performed sequentially. Skipping a major version is not supported. Documentation: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/8-0/updating-vsphere-supervisor/updating-the-vsphere-with-tanzu-environment/how-vsphere-iaas-contro-plane-updates-work.html

As the upgrade will not proceed as long as at least one control plane node is in an unhealthy state, the unhealthy control plane node(s) will need to be investigated and restored to a healthy state in order to resume the upgrade process. The below steps provide information on how to diagnose the cause of the unhealthy control plane node(s) in the affected cluster:

Confirm that the kubeadm control plane object (kcp) shows the desired upgrade version:
- ```
kubectl get kcp -n <affected cluster namespace>
```
Describe the kubeadm control plane object (kcp) for details:
- ```
kubectl describe kcp -n <affected cluster namespace>
```
- If describing the kcp object shows error messages similar to the following, there is an issue with the ETCD running on one or more of the control plane nodes:
  - ```
  machine my-control-plane-node-abc1 reports EtcdPodHealthy condition is false
```
- ```
could not establish a connection to any etcd node: unable to create etcd client
```
Check if the certificates for the affected clusters have expired:
- See certmgr script KB: https://broadcomcms-software.wolkenservicedesk.com/external/article?articleNumber=323453
- ```
./certmgr tkc certificates list -n <affected cluster namespace> <affected cluster name>
```
- If the above certmgr script command fails or does not return an output, the kube-apiserver is unhealthy or the vmware-system-user may be expired.
  - If the vmware-system-user is expired, please see the following vmware-system-user KB: https://knowledge.broadcom.com/external/article?articleNumber=319375
Connect into the affected cluster's context and confirm on the status of the nodes*:
- ```
kubectl get nodes
```
- *If kubectl commands are failing in the affected cluster, this indicates that there is an issue with the kube-apiserver and ETCD process within the affected cluster.

If any control plane node is in NotReady state, check the status of the pods and packages (pkgi) running on the unhealthy control plane node:

kubectl get pods -A -o wide | grep <NotReady control plane node name>

kubectl describe pod <unhealthy pod> -n <unhealthy pod's namespace>

```
kubectl get pkgi -A
```

kubectl describe pkgi <unhealthy pkgi> -n <unhealthy pkgi namespace>

For any control plane node in NotReady state, confirm on the health of the CNI (antrea or calico):
- ```
kubectl get pods -A | grep <antrea/calico>
```
- ```
kubectl get ds -A | grep <antrea/calico>
```