vSphere Workload Cluster Nodes Recreating or Cluster Upgrade Stuck - Antrea/Calico CNI not initialized due to Third Party Webhook
search cancel

vSphere Workload Cluster Nodes Recreating or Cluster Upgrade Stuck - Antrea/Calico CNI not initialized due to Third Party Webhook

book

Article ID: 387563

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service vSphere with Tanzu

Issue/Introduction

In a vSphere Supervisor environment, nodes in a workload cluster are recreating in a loop every 10 to 15 minutes.

As a result, this can lead to a workload cluster upgrade stuck or not progressing.

This is due to the Container Network Interface (CNI) failing to start on the affected node, leading to the system recreating the node.

  • Supported Container Network Interfaces are antrea or calico.

In this scenario, the CNI is unable to initialize due to a third party application using webhooks that was installed in the affected workload cluster.

NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.

 

While connected to the Supervisor cluster context, one or more of the following symptoms are observed:

  • If the affected cluster is a non-classy cluster, the TKC object shows Ready False state:
    kubectl get tkc -n <affected cluster namespace>
    
    NAMESPACE        NAME         CONTROL PLANE     WORKER     READY
    my-namespace     my-cluster         X             X        False
  • Machines for the affected cluster reach Running state but recreate after 10 to 15 minutes:
    kubectl get machine -n <affected cluster namespace>
  • This may be the result of a workload cluster upgrade and it is stuck at the control plane node upgrade part due to the above constant recreations.
    • Worker nodepools do not upgrade until all control plane nodes have upgraded successfully and stabilized as healthy.

 

While connected to the affected workload cluster context, the following symptoms are observed:

  • The recreating nodes are in NotReady status:
    kubectl get nodes


  • Performing a describe on the NotReady recreating node returns an error message similar to the following under Conditions:
    kubectl describe node <NotReady recreating node name>
    
    Conditions:
    Type      Status   LastHeartbeatTime          LastTransitionTime           Reason            Message
    -------   ------   -----------------          ------------------           ------            ------
    ...
    Ready     False    DAY, DD MON YYYY HH:MM:SS  DAY, DD MON YYYY HH:MM:SS    KubeletNotReady   container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  • A container network interface (CNI) is not Running on the affected node(s). Antrea or Calico CNI are supported:
    kubectl get pods -A -o wide | egrep "antrea|calico"
  • The antrea or calico CNI pods on the new node are failing due to ImagePullBackOff state with the following error messages when described:
    kubectl describe pod -n <cni namespace> <cni pod name>
    
    Failed to pull image "localhost:5000/tkg/packages/core/<cni>@sha256<hash>": rpc error: code = NotFound desc = failed to pull and unpack image <cni>@sha256<hash>": failed to resolve reference "localhost:5000/tkg/packages/core/<cni>@sha256<hash>": localhost:5000/tkg/packages/core/<cni>@sha256<hash>: not found
  • Checking the corresponding antrea or calico CNI replicaset and daemonset show that it is not healthy on one or more nodes:
    kubectl get replicaset,daemonset -n kube-system
    NAME                                          DESIRED CURRENT READY
    replicasets.apps/<CNI controller replicaset> X X X

    NAME DESIRED CURRENT READY
    daemonset.apps/<CNI node daemonset> X X X


  • Describing the antrea/calico CNI replicaset or daemonset shows an error message similar to the following:
    kubectl describe replicaset -n kube-system <CNI replicaset-name>
    
    kubectl describe daemonset -n kube-system <CNI daemonset-name>
    
    
    Internal error occurred: failed calling webhook "<webhook service>": failed to call webhook: Post "https://<webhook service address>:<port>/<action>/fail?timeout=10s": dial tcp <webhook service address>:443: connect: connection refused

 

While SSH directly to the affected recreating node, the following symptoms may be observed:

  • The following containers are running on the affected node but there is no antrea or calico CNI Running on the affected node:
    crictl ps
    • If the affected recreating node is a control plane node, check for the below containers:
      • etcd
        kube-apiserver
        docker-registry
        kube-controller-manager
        kube-scheduler
    • If the affected recreating node is a worker node, check for the below container:
      • docker-registry
  • The missing antrea or calico CNI image is not present on the affected node:
    crictl images list
  • Kubelet and Containerd system processes are running:
    systemctl status kubelet
    
    systemctl status containerd
  • Kubelet logs may show error messages similar to the following where the below values enclosed by <> will vary by environment:

    journalctl -xeu kubelet
    
    "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
    
    "Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"<webhook service>\": failed to call webhook: Post \"https://<webhook service address>:<port>\" pod="kube-system/docker-registry-<worker-node-id>"

Environment

vSphere Supervisor

This issue can occur regardless of whether or not the cluster is managed by TMC.

Cause

The vSphere Kubernetes system routinely performs health checks on all nodes in workload clusters.

Although the node's machine object will show as Running state from the Supervisor cluster context, the node shows NotReady state within the cluster's context due to the uninitialized Container Network Interface (CNI).

If the system health checks detect that the CNI is unhealthy or not running for up to 15 minutes, it will drain pods from the node then delete the node and attempt to recreate it.

When there is a webhook installed in the affected cluster which requires that pods are checked against the third party application webhook's service before allowing the pod to be created, this can prevent the CNI pod from starting on the affected node.

 

This can lead to the following scenarios:

Scenario 1 - Unavailable Webhook Service: If the third party application webhook's service is unavailable, the webhook check will fail as per the error message in the Issue/Introduction above:

Internal error occurred: failed calling webhook "<webhook service>": failed to call webhook: Post "https://<webhook service address>:<port>/<action>/fail?timeout=10s": dial tcp <webhook service address>:443: connect: connection refused

This leads to a recreation loop until the webhook's service is available or until the requirement that CNI pods must be checked against the webhook's service is removed.

In this scenario, the cause of the recreation loop is the third party application that installed the webhook and its webhook service.

The webhook service could be unavailable due to a networking issue or unhealthy third party application pods which are responsible for the webhook service in the affected cluster.

This could be due to an outage or issue that caused all nodes running the third party application pods to become unreachable, effectively bringing down the third party application and associated webhook service.

The system will attempt to recover the affected nodes by recreating them, but cannot bring up the CNI because of the failing webhook service and downed third party application pods.

Without a functioning CNI, the nodes cannot run the third party application pods responsible for the webhook service.

Due to the above factors, the nodes will continue to recreate in a loop every 10 - 15 minutes.

This scenario will occur when all nodes which originally ran the third party application's pods are stuck in this recreation loop.

 

Scenario 2 - Third Party Webhook Configuration Issues: A third party application webhook services is healthy but it uses webhooks to perform checks against and prevent the creation of Kubernetes resources in the workload cluster.

If this webhook is set to prevent certain resources from starting in certain namespaces, this can lead to new pods failing to start because this webhook service is denying the resource.

This leads to a recreation loop where the CNI pod cannot start up because it does not meet the requirements by the third party webhook service.

Most frequently, the third party application webhook service is preventing pods from starting in namespaces other than the third party webhook service's namespace.

As a result, the CNI pod fails to pull its necessary image and remains in ImagePullBackOff state until the webhook requirements are relaxed or removed. 

This scenario will occur when a newly created pod is denied by this webhook service, for example when the new pod is trying to start up on a newly created node.

Resolution

As a workaround, the validatingwebhookconfiguration and/or mutatingwebhookconfiguration corresponding to and requiring checks against the webhook service can be brought down temporarily. Taking down the webhookconfiguration(s) temporarily will remove the requirement that all pods must be checked against the third party application's webhook service.

NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications. Any issues with webhooks installed by a third party application should be escalated to the third party application owner.

 

The following steps detail taking a backup of all validatingwebhookconfigurations and mutatingwebhookconfigurations related to the third party application then deleting these validatingwebhookconfigurations and mutatingwebhookconfigurations to allow the Container Network Interface (CNI) to start up on the affected node(s).

  1. Connect into the affected workload cluster context

  2. Check the status of the webhook service pods in the cluster. These pods will likely be unhealthy or Pending state:
    Pending state indicates that the pod has not been scheduled on a node and that the pod has not started.
    kubectl get pods -A -o wide | grep -v Run

     

  3. Locate the validatingwebhookconfigurations and mutatingwebhookconfigurations for the third party application and webhook service:
    kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration -A

     

  4. Take backups of each validatingwebhookconfiguration and mutatingwebhookconfiguration for the third party application and webhook service:
    IMPORTANT: Only touch the webhookconfigurations associated with the third party application and webhook service.
    kubectl get validatingwebhookconfiguration <third party application validatingwebhookconfiguration> -o yaml > <third party application validatingwebhookconfiguration>-backup.yaml
    
    kubectl get mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration> -o yaml > <third party application mutatingwebhookconfiguration>-backup.yaml

     

  5. Confirm that the created backup yamls contain the expected validatingwebhookconfiguration and mutatingwebhookconfiguration information:
    less <third party application validatingwebhookconfiguration>-backup.yaml
    
    less <third party application mutatingwebhookconfiguration>-backup.yaml

     

  6. IMPORTANT: If you are directly connected (SSH) into a control plane node, copy the backups to another location (ex. the Supervisor cluster or a jumpbox machine)
    • The intention is to restore these webhookconfigurations after the CNI is able to start up and reach healthy state.
    • It is advised to copy the backups elsewhere in the event that the current node is recreated or deleted.

  7. Perform deletions on each validatingwebhookconfiguration and mutatingwebhookconfiguration associated with the third party application and webhook service:

    CAUTION: Only delete the webhookconfigurations associated with the third party application and webhook service.

    Deleting system Kubernetes webhookconfigurations will mark the cluster as unsupported and potentially irrecoverable.

    kubectl delete validatingwebhookconfiguration <third party application validatingwebhookconfiguration>
    
    kubectl delete mutatingwebhookconfiguration <third party application mutatingwebhookconfiguration>

     

  8. Confirm that the CNI is started on all nodes. This may take time to reconcile depending on the reconciliation process.
    kubectl get pods -A -o wide | egrep "antrea|calico"

     

  9. Check that all nodes are in Ready state:
    kubectl get nodes

     

  10. Note the status of the third party application pods:
    kubectl get pods -A | grep -v Run

     

  11. If the CNI image is still unable to be pulled, all of the third party application's pods may need to be temporarily scaled down:
    This is a known step necessary for Gatekeeper pods. Note down the original replica count.
    kubectl get deployment -n <third party application namespace>
    
    kubectl scale deployment -n <third party application namespace> <third party application deployment name> --replicas=0

     

  12. Check if the third party application's validatingwebhookconfiguration and mutatingwebhookconfiguration were automatically recreated:
    kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration -A

     

  13. Once the upgrade or change completes successfully:
    Recreate the third party application's validatingwebhookconfiguration and mutatingwebhookconfiguration from the backups if they were not automatically recreated:
    kubectl apply -f <third party application validatingwebhookconfiguration>-backup.yaml
    
    kubectl apply -f <third party application mutatingwebhookconfiguration>-backup.yaml

     

  14. If any of the third party application's deployments were scaled down, scale those pods back up to their original replica count:
    kubectl scale deploy -n <third party application namespace> <third party application deployment name> --replicas=#

     

  15. Please see the Additional Information section below for future considerations.

Additional Information

Broadcom is not responsible and cannot provide guidance on the configuration of third party applications.

Any issues with webhooks installed by a third party application should be escalated to the third party application owner.

Third party webhooks known to cause workload cluster upgrade issues:

  • Rancher
  • Gatekeeper
  • k8tz
  • Kyverno
  • Dynatrace
  • Linkerd

 

Expected system webhooks in the environment would be related to the CNI or any installed packages (PKGI) in the workload cluster.

For example, the expected system antrea webhooks are:

  • crdvalidator.antrea.io
  • crdmutator.antrea.io

 

Future Considerations

For webhooks that prevent image pulls and pod creation based on given namespaces, allow the following namespaces that are integral to VKS cluster lifecycle events:

  • kube-system
  • vmware-system-antrea
  • vmware-system-auth
  • vmware-system-cloud-provider
  • vmware-system-csi
  • tkg-system
  • secretgen-controller
  • vmware-system-supervisor-services
  • The namespace that houses VKS components which is unique to each environment and can be retrieved with:
     kubectl get ns | grep svc-tkg