Workload Cluster Deployment Fails with Kubelet Error: "...cni plugin not initialized"

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

You’ll notice that once you initiate the cluster creation, it gets stuck on the first control plane node. When you SSH into that node (ssh capv@<node-ip>), you’ll see that the following components are in a running state: kube-proxy, kube-vip, kube-controller-manager, etcd, kube-apiserver, and kube-scheduler:

CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT
c92f7b6a3bd01       2f7e1c45a1b8f       8 minutes ago       Running             kube-proxy                0
e581c43a7fa9d       a6b4c83219ee7       9 minutes ago       Running             kube-vip                  0
9ae04d63c2d67       fdc31eab2481c       9 minutes ago       Running             kube-controller-manager   0
1347c8e83d1f2       b9fe2019d61ab       9 minutes ago       Running             etcd                      0
f20ad46eb8c78       67cd2198b44d3       9 minutes ago       Running             kube-apiserver            0
db519adfa2b7e       4f2a97ed6cc90       9 minutes ago       Running             kube-scheduler            0

When examining the Kubelet on the control plane node, you’ll see that the service is running (systemctl status kubelet.service), but the logs continuously show that the CNI is not initialised:

journalctl -xeu kubelet

Jan 01 12:00:00 workload-cluster kubelet[1677]: E0101 12:00:00.000000    1677 kubelet.go:2855] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 01 12:00:00 workload-cluster kubelet[1677]:E0101 12:00:00.000000    1677 kubelet.go:2855] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

Environment

TKGm 2.4.0+

Cause

A common cause of this issue is that the Cluster API components (CAPI and CAPV) are unable to communicate with the new workload cluster. This can be confirmed by checking the CAPI logs:

E0101 12:00:00.000000       1 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"default/workload-cluster\": error creating client for remote cluster \"default/workload-cluster\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://198.51.100.1:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/workload-cluster-controlplane-v78ki" namespace="default" name="workload-cluster-controlplane-v78ki" reconcileID=""

CAPI is responsible for provisioning and managing the lifecycle of Kubernetes clusters. If the API server of the new workload cluster cannot communicate with CAPI on the management cluster, it won't be able to complete its setup tasks - including initialising the CNI. In the above case, a firewall rule is blocking communication on port 6443 to the workload cluster.

Resolution

In the above case, a firewall rule blocked traffic on port 6443, which prevented the workload cluster from registering with the management cluster. After the port was allowed in firewall configuration, the cluster deployed without issue. If you have a similar problem, review the below documents to ensure proper networking configuration:

If applicable, carefully review the proxy settings and any other networking configurations specified in the cluster configuration file during creation:

https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-kubernetes-grid/2-5/tkg/config-ref.html

Verify the overall network setup in the environment to ensure proper connectivity between the management cluster and the workload cluster:

https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-kubernetes-grid/2-5/tkg/about-tkg-compliance.html