Nodes in a vSphere Kubernetes cluster are recreating in a loop every 10 to 15 minutes.
This is due to the Container Network Interface (CNI) failing to start on the affected node, leading to the system recreating the node.
In this scenario, the CNI is unable to initialize due to a third party application using webhooks that was installed in the affected vSphere Kubernetes cluster.
NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get tkc -n <affected cluster namespace>
NAMESPACE NAME CONTROL PLANE WORKER READY
my-namespace my-cluster X X False
kubectl get machine -n <affected cluster namespace>
While connected to the affected vSphere Kubernetes cluster context, the following symptoms are observed:
kubectl get nodes
kubectl describe node <NotReady recreating node name>
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
------- ------ ----------------- ------------------ ------ ------
...
Ready False DAY, DD MON YYYY HH:MM:SS DAY, DD MON YYYY HH:MM:SS KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
kubectl get pods -A -o wide | egrep "antrea|calico"
kubectl get pods -A -o wide | grep -v Run
While SSH directly to the affected recreating node, the following symptoms are observed:
crictl ps -a
etcd
kube-apiserver
docker-registry
kube-controller-manager
kube-scheduler
docker-registry
crictl images list
systemctl status kubelet
systemctl status containerd
Kubelet logs show error messages similar to the following where the below values enclosed by <> will vary by environment:
journalctl -xeu kubelet
"Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
"Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"<webhook service>\": failed to call webhook: Post \"https://<webhook service address>:<port>\": dial tcp:<IP address>:<port>: connect: connection refused" pod="kube-system/docker-registry-<my-worker-node-abc12-def34>"
vSphere 7.0 with Tanzu
vSphere 8.0 with Tanzu
This issue can occur regardless of whether or not the cluster is managed by TMC.
The vSphere Kubernetes system routinely performs health checks on all nodes in vSphere Kubernetes clusters.
Although the node's machine object will show as Running state from the Supervisor cluster context, the node shows NotReady state within the cluster's context due to the uninitialized Container Network Interface (CNI).
If the system health checks detect that the CNI is unhealthy or not running for up to 15 minutes, it will delete the node and attempt to recreate it.
When there is a webhook installed in the affected cluster which requires that pods are checked against the third party application webhook's service before allowing the pod to be created, this can prevent the CNI pod from starting on the affected node.
If the third party application webhook's service is unavailable, the webhook check will fail as per the error message in the Issue/Introduction above:
failed calling webhook \"<webhook service>\": failed to call webhook: Post \"https://<webhook service address>:<port>\"
This leads to a recreation loop until the webhook's service is available or until the requirement that pods must be checked against the webhook's service is removed.
In this scenario, the cause of the recreation loop is the third party application that installed the webhook and its webhook service.
The webhook service could be unavailable due to a networking issue or unhealthy third party application pods which are responsible for the webhook service in the affected cluster.
This could be due to an outage or issue that caused all nodes running the third party application pods to become unreachable, effectively bringing down the third party application and associated webhook service.
The system will attempt to recover the affected nodes by recreating them, but cannot bring up the CNI because of the failing webhook service and downed third party application pods.
Without a functioning CNI, the nodes cannot run the third party application pods responsible for the webhook service.
Due to the above factors, the nodes will continue to recreate in a loop every 10 - 15 minutes.
This scenario will occur when all nodes which originally ran the third party application's pods are stuck in this recreation loop.
NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.
If it is not possible to restore the webhook service to a healthy state, the validatingwebhookconfiguration and/or mutatingwebhookconfiguration corresponding to and requiring checks against the webhook service will need to be brought down temporarily. Taking down the webhookconfiguration(s) temporarily will remove the requirement that all pods must be checked against the failing webhook service.
This scenario will occur when all nodes which originally ran the third party application's pods are stuck in this recreation loop.
NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.
The following steps detail taking a backup of all validatingwebhookconfigurations and mutatingwebhookconfigurations related to the third party application then deleting these validatingwebhookconfigurations and mutatingwebhookconfigurations to allow the Container Network Interface (CNI) to start up on the affected node(s).
kubectl get pods -A -o wide| grep -v Run
kubectl get validatingwebhookconfiguration -A
kubectl get mutatingwebhookconfiguration -A
kubectl get validatingwebhookconfiguration <third party application validatingwebhookconfiguration> -o yaml > <third party application validatingwebhookconfiguration>-backup.yaml
kubectl get mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration> -o yaml > <third party application mutatingwebhookconfiguration>-backup.yaml
less <third party application validatingwebhookconfiguration>-backup.yaml
less <third party application mutatingwebhookconfiguration>-backup.yaml
kubectl delete validatingwebhookconfiguration <third party application validatingwebhookconfiguration>
kubectl delete mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration>
kubectl get pods -A -o wide | egrep "antrea|calico"
kubectl get nodes
kubectl get pods -A | grep -v Run
kubectl get validatingwebhookconfiguration -A
kubectl get mutatingwebhookconfiguration -A
kubectl apply -f <third party application validatingwebhookconfiguration>-backup.yaml
kubectl apply -f <third party application mutatingwebhookconfiguration>-backup.yaml