vSphere Kubernetes Cluster Nodes Continuously Recreating - Container Network Interface (CNI) not initialized due to Third Party Webhook

Products

VMware vSphere 7.0 with Tanzu VMware vSphere with Tanzu vSphere with Tanzu

Issue/Introduction

Nodes in a vSphere Kubernetes cluster are recreating in a loop every 10 to 15 minutes.

This is due to the Container Network Interface (CNI) failing to start on the affected node, leading to the system recreating the node.

In this scenario, the CNI is unable to initialize due to a third party application using webhooks that was installed in the affected vSphere Kubernetes cluster.

NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.

While connected to the Supervisor cluster context, the following symptoms are observed:

If the affected cluster is a non-classy cluster, the TKC object shows Ready False state:

kubectl get tkc -n <affected cluster namespace>

NAMESPACE        NAME         CONTROL PLANE     WORKER     READY
my-namespace     my-cluster         X             X        False

The machines for the affected cluster reach Running state but recreate after 10 to 15 minutes:
- ```
kubectl get machine -n <affected cluster namespace>
```
There is not an ongoing upgrade for the affected cluster.

While connected to the affected vSphere Kubernetes cluster context, the following symptoms are observed:

The recreating nodes are in NotReady status:
- ```
kubectl get nodes
```

Performing a describe on the NotReady recreating node returns an error message similar to the following under Conditions:

kubectl describe node <NotReady recreating node name>

Conditions:
Type      Status   LastHeartbeatTime          LastTransitionTime           Reason            Message
-------   ------   -----------------          ------------------           ------            ------
...
Ready     False    DAY, DD MON YYYY HH:MM:SS  DAY, DD MON YYYY HH:MM:SS    KubeletNotReady   container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

A container network interface (CNI) is not Running on the affected node(s). Antrea or Calico CNI are supported:
- ```
kubectl get pods -A -o wide | egrep "antrea|calico"
```
All pods installed by a third party application which also uses a webhook are unhealthy or in Pending state:
- Pending state indicates that the pod has not been scheduled on a node and that the pod has not started.
- ```
kubectl get pods -A -o wide | grep -v Run
```

While SSH directly to the affected recreating node, the following symptoms are observed:

The following containers are running on the affected node but there is no antrea or calico CNI Running on the affected node:
- ```
crictl ps -a
```
- If the affected recreating node is a control plane node, check for the below containers:
  - ```
  etcd
  kube-apiserver
  docker-registry
  kube-controller-manager
  kube-scheduler
```
- If the affected recreating node is a worker node, check for the below container:
  - ```
  docker-registry
```
The antrea or calico CNI image is not present on the affected node:
- ```
crictl images list
```
Kubelet and Containerd system processes are running:
- ```
systemctl status kubelet
```
- ```
systemctl status containerd
```

Kubelet logs show error messages similar to the following where the below values enclosed by <> will vary by environment:

```
journalctl -xeu kubelet
```

"Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

"Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"<webhook service>\": failed to call webhook: Post \"https://<webhook service address>:<port>\": dial tcp:<IP address>:<port>: connect: connection refused" pod="kube-system/docker-registry-<my-worker-node-abc12-def34>"

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

This issue can occur regardless of whether or not the cluster is managed by TMC.

Cause

The vSphere Kubernetes system routinely performs health checks on all nodes in vSphere Kubernetes clusters.

Although the node's machine object will show as Running state from the Supervisor cluster context, the node shows NotReady state within the cluster's context due to the uninitialized Container Network Interface (CNI).

If the system health checks detect that the CNI is unhealthy or not running for up to 15 minutes, it will delete the node and attempt to recreate it.

When there is a webhook installed in the affected cluster which requires that pods are checked against the third party application webhook's service before allowing the pod to be created, this can prevent the CNI pod from starting on the affected node.

If the third party application webhook's service is unavailable, the webhook check will fail as per the error message in the Issue/Introduction above:

failed calling webhook \"<webhook service>\": failed to call webhook: Post \"https://<webhook service address>:<port>\"

This leads to a recreation loop until the webhook's service is available or until the requirement that pods must be checked against the webhook's service is removed.

In this scenario, the cause of the recreation loop is the third party application that installed the webhook and its webhook service.

The webhook service could be unavailable due to a networking issue or unhealthy third party application pods which are responsible for the webhook service in the affected cluster.

This could be due to an outage or issue that caused all nodes running the third party application pods to become unreachable, effectively bringing down the third party application and associated webhook service.

The system will attempt to recover the affected nodes by recreating them, but cannot bring up the CNI because of the failing webhook service and downed third party application pods.

Without a functioning CNI, the nodes cannot run the third party application pods responsible for the webhook service.

Due to the above factors, the nodes will continue to recreate in a loop every 10 - 15 minutes.

This scenario will occur when all nodes which originally ran the third party application's pods are stuck in this recreation loop.

NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.

Resolution

If it is not possible to restore the webhook service to a healthy state, the validatingwebhookconfiguration and/or mutatingwebhookconfiguration corresponding to and requiring checks against the webhook service will need to be brought down temporarily. Taking down the webhookconfiguration(s) temporarily will remove the requirement that all pods must be checked against the failing webhook service.

This scenario will occur when all nodes which originally ran the third party application's pods are stuck in this recreation loop.

NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.

The following steps detail taking a backup of all validatingwebhookconfigurations and mutatingwebhookconfigurations related to the third party application then deleting these validatingwebhookconfigurations and mutatingwebhookconfigurations to allow the Container Network Interface (CNI) to start up on the affected node(s).

Connect into the affected vSphere Kubernetes cluster context
Check the status of the webhook service pods in the cluster. These pods will likely be unhealthy or Pending state:
- Pending state indicates that the pod has not been scheduled on a node and that the pod has not started.
- ```
kubectl get pods -A -o wide| grep -v Run
```
Locate the validatingwebhookconfigurations and mutatingwebhookconfigurations for the third party application and webhook service:
- ```
kubectl get validatingwebhookconfiguration -A
```
- ```
kubectl get mutatingwebhookconfiguration -A
```

Take backups of each validatingwebhookconfiguration and mutatingwebhookconfiguration for the third party application and webhook service:

NOTE: Only touch the webhookconfigurations associated with the third party application and webhook service.

kubectl get validatingwebhookconfiguration <third party application validatingwebhookconfiguration> -o yaml > <third party application validatingwebhookconfiguration>-backup.yaml

kubectl get mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration> -o yaml > <third party application mutatingwebhookconfiguration>-backup.yaml

Confirm that the created backup yamls contain the validatingwebhookconfiguration and mutatingwebhookconfiguration information:

less <third party application validatingwebhookconfiguration>-backup.yaml

less <third party application mutatingwebhookconfiguration>-backup.yaml

If you are directly connected (SSH) into a control plane node, copy the backups to another location (ex. the Supervisor cluster or a jumpbox machine)
- The intention is to restore these webhookconfigurations after the CNI is able to start up and reach healthy state.
- It is advised to copy the backups elsewhere in the event that the current node is recreated or deleted.
Perform deletions on each validatingwebhookconfiguration and mutatingwebhookconfiguration associated with the third party application and webhook service:
- CAUTION: Only delete the webhookconfigurations associated with the third party application and webhook service. Deleting Kubernetes webhookconfigurations will mark the cluster as unsupported and cause further issues.
- ```
kubectl delete validatingwebhookconfiguration <third party application validatingwebhookconfiguration>
```
- ```
kubectl delete mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration>
```

Confirm that the CNI is started on all nodes:

kubectl get pods -A -o wide | egrep "antrea|calico"

Check that all nodes are in Ready state:
- ```
kubectl get nodes
```
Note the status of the third party application pods:
- ```
kubectl get pods -A | grep -v Run
```
Check if the third party application's validatingwebhookconfiguration and mutatingwebhookconfiguration were automatically recreated:
- ```
kubectl get validatingwebhookconfiguration -A
```
- ```
kubectl get mutatingwebhookconfiguration -A
```

Recreate the third party application's validatingwebhookconfiguration and mutatingwebhookconfiguration from the backups if they were not automatically recreated:

kubectl apply -f <third party application validatingwebhookconfiguration>-backup.yaml

kubectl apply -f <third party application mutatingwebhookconfiguration>-backup.yaml