vSphere Kubernetes Cluster shows Ready False, Clusterbootstrap Reconciling state due to Antrea-agent Pods Not All Up To Date

Products

VMware vSphere Kubernetes Service vSphere with Tanzu

Issue/Introduction

The TKC or Cluster object of the affected vSphere Kubernetes cluster shows that the cluster is not healthy. However, all nodes, vmware and kube system pods are Running and Healthy.

While connected to the Supervisor cluster context, the following symptoms are present:

For non-classy clusters, the TKC object will show as Ready False state and describing the TKC will return Conditions and Events that the Clusterbootstrap is Reconciling:

kubectl get tkc <tkc name> -n <affected cluster namespace>

kubectl describe tkc <tkc name> -n <affected cluster namespace>


Conditions:
...
    Status: "ClusterBootstrap conditions indicate it is still reconciling"
    Type: ClusterBootstrapReady

While connected to the affected vSphere Kubernetes Cluster context, the following symptoms are present:

All vmware and kube-system pods are in a healthy Running state.
The antrea packageinstall (pkgi) shows in ReconcileFailed state and describing the pkgi shows that it timed out:
- ```
kubectl get pkgi -A
```
- ```
kubectl describe pkgi <antrea pkgi> -n <antrea pkgi namespace>
```
The antrea daemonset shows that the Up to Date count does not match the Desired and Ready counts for antrea-agent pods where X will match the total number of nodes in the cluster and Z will be a lower number than X, indicating that not all antrea-agent pods are considered up to date:
- ```
kubectl get ds antrea-agent -n kube-system

NAME        DESIRED    CURRENT   READY   UP-TO-DATE    AVAILABLE   NODE SELECTOR
antrea-agent   X          X        X         Z            X           kubernetes.io/os=<linux/ubuntu>  #d
```

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

This issue can occur regardless of whether or not this cluster is managed by TMC.

Cause

vSphere Kubernetes cluster node health is dependent on the Container Network Interface (CNI) to be in a healthy, up to date state on all nodes in the cluster.

In this scenario, the daemonset responsible for managing all CNI instances per node is detecting that not all CNI pods are updated.

This occurs regardless of the actual state of the CNI pods. The CNI pods could all be in Running state and functioning but the daemonset is detecting it is not properly updated with the latest changes to the CNI in the cluster. This out of date situation could be triggered by a change in the cluster such as a cluster upgrade.

In this KB, we assume that the CNI in the cluster is antrea and that the affected CNI pods are antrea-agent pods.

Resolution

Locate and restart the antrea-agent pods detected as out of date/not up to date by the antrea daemonset.

Connect to the vSphere Kubernetes cluster
Check the antrea-agent daemonset and confirm that Up to Date count does not match the Desired and Ready count:
- ```
kubectl get ds antrea-agent -n kube-system
```
Confirm that all antrea-agent pods are in Running state and that there is a pod for each node in the affected cluster:
- ```
kubectl get pods -n kube-system -o wide | grep antrea-agent
```
Check if there are any version differences in the labels for any of the antrea-agent pods:
- ```
kubectl get pods -n kube-system --show-labels | grep antrea-agent
```
If there are no version differences, check the logs for each antrea-agent pod to determine if there are any stale entries:
- ```
kubectl logs -n kube-system <antrea-agent pod> | less
```
Restart the antrea-agent pod(s) with a label difference or stale logs:
- Note: If there are multiple antrea-agent pods affected, please restart one at a time, waiting for the previous pod to reach Running state before moving onto the next.
- ```
kubectl delete pod -n kube-system <affected antrea-agent pod>
```
Confirm that the antrea-agent daemonset Up to Date count matches the Desired and Ready count:
- ```
kubectl get ds antrea-agent -n kube-system
```
The kapp-controller will automatically and periodically reconcile all packageinstalls (pkgi) in the cluster, including antrea.
- Once it reconciles after the restarts, the pkgi should show Reconcile succeeded state:
  - ```
  kubectl get pkgi -A
```
- From the Supervisor cluster, the affected cluster should no longer show Ready False or Clusterbootstrap reconciling:
  - ```
  kubectl get tkc <tkc name> -n <affected cluster namespace>
```
- ```
kubectl describe tkc <tkc name> -n <affected cluster namespace>
```