The TKC or Cluster object of the affected vSphere Kubernetes cluster shows that the cluster is not healthy. However, all nodes, vmware and kube system pods are Running and Healthy.
While connected to the Supervisor cluster context, the following symptoms are present:
kubectl get tkc <tkc name> -n <affected cluster namespace>
kubectl describe tkc <tkc name> -n <affected cluster namespace>
Conditions:
...
Status: "ClusterBootstrap conditions indicate it is still reconciling"
Type: ClusterBootstrapReady
While connected to the affected vSphere Kubernetes Cluster context, the following symptoms are present:
kubectl get pkgi -A
kubectl describe pkgi <antrea pkgi> -n <antrea pkgi namespace>
kubectl get ds antrea-agent -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR
antrea-agent X X X Z X kubernetes.io/os=<linux/ubuntu> #d
vSphere 7.0 with Tanzu
vSphere 8.0 with Tanzu
This issue can occur regardless of whether or not this cluster is managed by TMC.
vSphere Kubernetes cluster node health is dependent on the Container Network Interface (CNI) to be in a healthy, up to date state on all nodes in the cluster.
In this scenario, the daemonset responsible for managing all CNI instances per node is detecting that not all CNI pods are updated.
This occurs regardless of the actual state of the CNI pods. The CNI pods could all be in Running state and functioning but the daemonset is detecting it is not properly updated with the latest changes to the CNI in the cluster. This out of date situation could be triggered by a change in the cluster such as a cluster upgrade.
In this KB, we assume that the CNI in the cluster is antrea and that the affected CNI pods are antrea-agent pods.
Locate and restart the antrea-agent pods detected as out of date/not up to date by the antrea daemonset.
kubectl get ds antrea-agent -n kube-system
kubectl get pods -n kube-system -o wide | grep antrea-agent
kubectl get pods -n kube-system --show-labels | grep antrea-agent
kubectl logs -n kube-system <antrea-agent pod> | less
kubectl delete pod -n kube-system <affected antrea-agent pod>
kubectl get ds antrea-agent -n kube-system
kubectl get pkgi -A
kubectl get tkc <tkc name> -n <affected cluster namespace>
kubectl describe tkc <tkc name> -n <affected cluster namespace>