TKG - Workload Cluster Creation stalled during First Control-Plane node deployment
search cancel

TKG - Workload Cluster Creation stalled during First Control-Plane node deployment

book

Article ID: 377079

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid

Issue/Introduction

During the deployment of a new Workload Cluster, the first control-plane node is UP with IP address. However, the cluster creation process stalls with the following errors.

 

contained in the Workload Cluster Control-plane node

ssh capv@${CONTROL_PLANE_NODE_IPADDR}
sudo journalctl -u containerd
#> Nov 12 06:54:42 ${CLUSTER_NAME}-controlplane-###-### containerd[670]: time="2024-11-12T06:54:42.834559545Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

 

kubelet in the Workload Cluster Control-plane node

ssh capv@${CONTROL_PLANE_NODE_IPADDR}
sudo journalctl -u kubelet
#> Nov 12 05:50:43 ${CLUSTER_NAME}-controlplane-###-#### kubelet[1567]: E1120 05:50:43.564108    1567 kubelet.go:2855] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

 

kapp-controller pod in the Management Cluster

kubectl -n ${CLUSTER_NAMESPACE} describe pkgi ${CLUSTER_NAME}-kapp-controller
#> Status:
#>   Conditions:
#>     Message:            Error (see .status.usefulErrorMessage for details)
#>     Status:                True
#>     Type:                  ReconcileFailed
#>   Friendly Description:    Reconcile failed: Error (see .status.usefulErrorMessage for details)
#>   ...
#>   Useful Error Message:    kapp: Error: Getting app:
#>   Get "https://${CLUSTER_ENDPOINT_IPADDRESS}:6443/api/v1/namespaces/default/configmaps/${CLUSTER_NAME}-kapp-controller.app.apps.k14s.io": dial tcp ${CLUSTER_ENDPOINT_IPADDRESS}:6443: i/o timeout

 

This issue often occurs when the Management and Workload Clusters are in different subnets and under strict firewall access controls.

Environment

Tanzu Kubernetes Grid 2.x

Cause

There could be numerous reasons.  One possibility is that the Management Cluster can't access the new Workload Cluster endpoint IP address.

Check network reachability from Management Cluster to Workload Cluster API endpoint

ssh capv@${MANAGEMENT_CLUSTER_CONTROL_PLANE_IPADDRESS}
curl -vk https://${WORKLOAD_CLUSTER_ENDPOINT_IPADDRESS}:6443
* Trying ${WORKLOAD_CLUSTER_ENDPOINT_IPADDRESS}:6443...


(No network reachability due to firewall between subnets etc...)

 

Resolution

Review the access control setting of Firewall or Gateway to fix the network reachability.

  • Case-1: From Management Cluster network to NSX-ALB VIP network
  • Case-2: From NSX-ALB VIP network to Worload Cluster network
  • Case-3: From Management Cluster network to Workload Cluster network when using kube-vip (VSPHERE_CONTROL_PLANE_ENDPOINT)