In a vSphere Supervisor environment, nodes in a workload cluster are recreating in a loop every 10 to 15 minutes.
As a result, this can lead to a workload cluster upgrade stuck or not progressing.
This is due to the Container Network Interface (CNI) failing to start on the affected node, leading to the system recreating the node.
In this scenario, the CNI is unable to initialize due to a third party application using webhooks that was installed in the affected workload cluster.
NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.
While connected to the Supervisor cluster context, one or more of the following symptoms are observed:
kubectl get tkc -n <affected cluster namespace>
NAMESPACE NAME CONTROL PLANE WORKER READY
my-namespace my-cluster X X False
kubectl get machine -n <affected cluster namespace>
While connected to the affected workload cluster context, the following symptoms are observed:
kubectl get nodes
kubectl describe node <NotReady recreating node name>
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
------- ------ ----------------- ------------------ ------ ------
...
Ready False DAY, DD MON YYYY HH:MM:SS DAY, DD MON YYYY HH:MM:SS KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
kubectl get pods -A -o wide | egrep "antrea|calico"
kubectl describe pod -n <cni namespace> <cni pod name>
Failed to pull image "localhost:5000/tkg/packages/core/<cni>@sha256<hash>": rpc error: code = NotFound desc = failed to pull and unpack image <cni>@sha256<hash>": failed to resolve reference "localhost:5000/tkg/packages/core/<cni>@sha256<hash>": localhost:5000/tkg/packages/core/<cni>@sha256<hash>: not found
kubectl get replicaset,daemonset -n kube-system
NAME DESIRED CURRENT READY
replicasets.apps/<CNI controller replicaset> X X X
NAME DESIRED CURRENT READY
daemonset.apps/<CNI node daemonset> X X X
kubectl describe replicaset -n kube-system <CNI replicaset-name>
kubectl describe daemonset -n kube-system <CNI daemonset-name>
Internal error occurred: failed calling webhook "<webhook service>": failed to call webhook: Post "https://<webhook service address>:<port>/<action>/fail?timeout=10s": dial tcp <webhook service address>:443: connect: connection refused
While SSH directly to the affected recreating node, the following symptoms may be observed:
crictl ps
etcd
kube-apiserver
docker-registry
kube-controller-manager
kube-scheduler
docker-registry
crictl images list
systemctl status kubelet
systemctl status containerd
Kubelet logs may show error messages similar to the following where the below values enclosed by <> will vary by environment:
journalctl -xeu kubelet
"Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
"Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"<webhook service>\": failed to call webhook: Post \"https://<webhook service address>:<port>\" pod="kube-system/docker-registry-<worker-node-id>"
This issue can occur regardless of whether or not the cluster is managed by TMC.
The vSphere Kubernetes system routinely performs health checks on all nodes in workload clusters.
Although the node's machine object will show as Running state from the Supervisor cluster context, the node shows NotReady state within the cluster's context due to the uninitialized Container Network Interface (CNI).
If the system health checks detect that the CNI is unhealthy or not running for up to 15 minutes, it will drain pods from the node then delete the node and attempt to recreate it.
When there is a webhook installed in the affected cluster which requires that pods are checked against the third party application webhook's service before allowing the pod to be created, this can prevent the CNI pod from starting on the affected node.
This can lead to the following scenarios:
Scenario 1 - Unavailable Webhook Service: If the third party application webhook's service is unavailable, the webhook check will fail as per the error message in the Issue/Introduction above:
Internal error occurred: failed calling webhook "<webhook service>": failed to call webhook: Post "https://<webhook service address>:<port>/<action>/fail?timeout=10s": dial tcp <webhook service address>:443: connect: connection refused
This leads to a recreation loop until the webhook's service is available or until the requirement that CNI pods must be checked against the webhook's service is removed.
In this scenario, the cause of the recreation loop is the third party application that installed the webhook and its webhook service.
The webhook service could be unavailable due to a networking issue or unhealthy third party application pods which are responsible for the webhook service in the affected cluster.
This could be due to an outage or issue that caused all nodes running the third party application pods to become unreachable, effectively bringing down the third party application and associated webhook service.
The system will attempt to recover the affected nodes by recreating them, but cannot bring up the CNI because of the failing webhook service and downed third party application pods.
Without a functioning CNI, the nodes cannot run the third party application pods responsible for the webhook service.
Due to the above factors, the nodes will continue to recreate in a loop every 10 - 15 minutes.
This scenario will occur when all nodes which originally ran the third party application's pods are stuck in this recreation loop.
Scenario 2 - Third Party Webhook Configuration Issues: A third party application webhook services is healthy but it uses webhooks to perform checks against and prevent the creation of Kubernetes resources in the workload cluster.
If this webhook is set to prevent certain resources from starting in certain namespaces, this can lead to new pods failing to start because this webhook service is denying the resource.
This leads to a recreation loop where the CNI pod cannot start up because it does not meet the requirements by the third party webhook service.
Most frequently, the third party application webhook service is preventing pods from starting in namespaces other than the third party webhook service's namespace.
As a result, the CNI pod fails to pull its necessary image and remains in ImagePullBackOff state until the webhook requirements are relaxed or removed.
This scenario will occur when a newly created pod is denied by this webhook service, for example when the new pod is trying to start up on a newly created node.
As a workaround, the validatingwebhookconfiguration and/or mutatingwebhookconfiguration corresponding to and requiring checks against the webhook service can be brought down temporarily. Taking down the webhookconfiguration(s) temporarily will remove the requirement that all pods must be checked against the third party application's webhook service.
NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.
The following steps detail taking a backup of all validatingwebhookconfigurations and mutatingwebhookconfigurations related to the third party application then deleting these validatingwebhookconfigurations and mutatingwebhookconfigurations to allow the Container Network Interface (CNI) to start up on the affected node(s).
kubectl get pods -A -o wide | grep -v Run
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration -A
kubectl get validatingwebhookconfiguration <third party application validatingwebhookconfiguration> -o yaml > <third party application validatingwebhookconfiguration>-backup.yaml
kubectl get mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration> -o yaml > <third party application mutatingwebhookconfiguration>-backup.yaml
less <third party application validatingwebhookconfiguration>-backup.yaml
less <third party application mutatingwebhookconfiguration>-backup.yaml
kubectl delete validatingwebhookconfiguration <third party application validatingwebhookconfiguration>
kubectl delete mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration>
kubectl get pods -A -o wide | egrep "antrea|calico"
kubectl get nodes
kubectl get pods -A | grep -v Run
kubectl get deployment -n <third party application namespace>
kubectl scale deployment -n <third party application namespace> <third party application deployment name> --replicas=0
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration -A
kubectl apply -f <third party application validatingwebhookconfiguration>-backup.yaml
kubectl apply -f <third party application mutatingwebhookconfiguration>-backup.yaml
kubectl scale deploy -n <third party application namespace> <third party application deployment name> --replicas=#
Broadcom is not responsible and cannot provide guidance on the configuration of third party applications.
Any issues with webhooks installed by a third party application should be escalated to the third party application owner.
Third party webhooks known to cause workload cluster upgrade issues:
Expected system webhooks in the environment would be related to the CNI or any installed packages (PKGI) in the workload cluster.
For example, the expected system antrea webhooks are:
Future Considerations
For webhooks that prevent image pulls and pod creation based on given namespaces, allow the following namespaces that are integral to VKS cluster lifecycle events:
kubectl get ns | grep svc-tkg