Management Cluster Creation Stuck on The First Control-Plane Node Since DHCP Server is Not Configured With Option 3 (Router)
search cancel

Management Cluster Creation Stuck on The First Control-Plane Node Since DHCP Server is Not Configured With Option 3 (Router)

book

Article ID: 383244

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

  • The Management Cluster creation task stuck after creating the first control-plane node in vCenter.
  • The first control-plane node will gets an IP address assigned to it after powering on.

+++pods/capi-system_capi-controller-manager-6c8f585878-#######_########-#####-####-####-##########/manager/0.log+++

I1114 16:37:40.344850       1 service.go:376] "VM is powered on" controller="vspherevm" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="VSphereVM" VSphereVM="tkg-system/ManagementCluster-controlplane-####" namespace="tkg-system" name="ManagementCluster-controlplane-####>" reconcileID="161d1903-####-####-####-da6090a04ec2" Cluster="tkg-system/<Management-Cluster-Name>" VSphereMachine="tkg-system/ManagementCluster-controlplane-####" VSphereCluster="tkg-system/ManagementCluster-Name-####" Machine="tkg-system/ManagementCluster-controlplane-####" KubeadmControlPlane="tkg-system/ManagementCluster-kcp"

  • MachineHealthCheck will keep reporting the control-plane node as unhealthy and deleted it after 20 minutes as since node failed to report startup as kubelet didn't report node status.

+++pods/capi-system_capi-controller-manager-6c8f585878-#######_########-#####-####-####-##########/manager/0.log+++

2024-11-14T16:57:41.753158617Z stderr F I1114 16:57:41.753087       1 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="tkg-system/ManagementCluster-controlplane-####" namespace="tkg-system" name="ManagementCluster-controlplane-####" reconcileID="3c623cd8-####-####-####-ae8d8c59abd7" Cluster="tkg-system/ManagementCluster" target="tkg-system/ManagementCluster-controlplane-####/ManagementCluster-controlplane-####-####/" reason="NodeStartupTimeout" message="Node failed to report startup in 20m0s”

  • kubelet service s not running on the control-plane node.

# systemctl status kubelet

• kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded C/usr/lib/systend/systen/kubelet.service; enabled; preset: enabled)
brop in: 38/18.51005vste/AwbeLet. servlce. d
Active: activating (auto-restart) (Result: exit-code) since Fri 2024-11-15 19:01:21 UTC; 4s ago
Docs: https://kubernetes.io/docs/home/
Activate Windows
Process: 1544 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS SKUBELET_CONFIG_ARGS $KUBELET_KUBEADMARGS SKUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
Main PID: 1544 (code=exited, status=1/FAILURE)
CPU: 38ms

  • There are no containers running on the control-plane node.
  • kubeadm on the control-plane node is faling to use the eth0  to set the advertise address for the API server since its gateway IP Address is set as  "0.0.0.0".

+++/var/log/cloud-init-output.log+++


[2024-11-16 23:04:12] W1116 23:04:12.494528     982 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/containerd/cont
ainerd.sock". Please update your configuration!
[2024-11-16 23:04:12] W1116 23:04:12.494808     982 common.go:192] WARNING: could not obtain a bind address for the API Server: no default routes found in "/proc/net/route" or "/proc/net/ipv6_route"; using: 0.0.0.0

  • The /var/log/cloud-init-output.log on the control-plane node will show that the Gateway  IP Address was set on eth0 as 0.0.0.0 .

[2024-11-16 23:01:09] ci-info: +++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++
[2024-11-16 23:01:09] ci-info: +--------+------+-----------------------------+-----------------+--------+-------------------+
[2024-11-16 23:01:09] ci-info: | Device |  Up  |           Address           |       Mask      | Scope  |     Hw-Address    |
[2024-11-16 23:01:09] ci-info: +--------+------+-----------------------------+-----------------+--------+-------------------+
[2024-11-16 23:01:09] ci-info: |  eth0  | True |         x.x.x.5             | 255.255.255.X.  | global | 00:xx:xx:xx:xx:67 |
[2024-11-16 23:01:09] ci-info: |  eth0  | True | fe80::xxx:xxxx:xxxx:bc67/64 |        .        |  link  | 00:xx:xx:xx:xx:67 |
[2024-11-16 23:01:09] ci-info: |   lo   | True |          127.0.0.1          |    255.0.0.0    |  host  |         .         |
[2024-11-16 23:01:09] ci-info: |   lo   | True |           ::1/128           |        .        |  host  |         .         |
[2024-11-16 23:01:09] ci-info: +--------+------+-----------------------------+-----------------+--------+-------------------+
[2024-11-16 23:01:09] ci-info: ++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++
[2024-11-16 23:01:09] ci-info: +-------+-------------+---------+-----------------+-----------+-------+
[2024-11-16 23:01:09] ci-info: | Route | Destination | Gateway |     Genmask     | Interface | Flags |
[2024-11-16 23:01:09] ci-info: +-------+-------------+---------+-----------------+-----------+-------+
[2024-11-16 23:01:09] ci-info: |   0   |  xx.xx.0.0  | 0.0.0.0 | 255.255.255.xx  |    eth0   |   U   |
[2024-11-16 23:01:09] ci-info: +-------+-------------+---------+-----------------+-----------+-------+

 

  • Running the following command to display the routing table on the control-plane node, confirms that the default gateway IP Address set on the eth0 as "0.0.0.0"

# route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
xx.xx.0.0      0.0.0.0         255.255.255.xx U     1024   0        0 eth0

 

Environment

VMware Tanzu Kubernetes Grid Management (VMware Tanzu Kubernetes Grid Management) (TKGm)

Cause

  • The Management Cluster creation is faling since the control-plane node is set with default gateway 0.0.0.0.
  • The DHCP server scope for the Management Cluster network is not configured with  With Option 3 (Router), which will be used to set the default gateway on the Management Cluster nodes.
  • The DHCP lease file located on the control-plane node is missing the ROUTER parameter.

# cat /run/systemd/netif/leases/2

# This is private data. Do not parse.
ADDRESS=x.x.x.5
NETMASK=255.255.255.x
SERVER_ADDRESS=10.x.x.6
T1=17018
T2=29781
LIFETIME=34036
DNS=x.x.x.8 x.x.x.9
NTP=x.x.x.1 x.x.x.2 x.x.x.3
HOSTNAME=ManagementCluster-controlplane-#####
CLIENTID=ffb6xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx109

Resolution

  • If the Management Cluster getting deployed using DHCP server, then the Management cluster scope on the DHCP server need to be configured with Option 3 (Router) to set the default gateway for the cluster nodes.
  • If the Management Cluster is been created using Node IPAM then the management cluster configuration file need to be edited to include the MANAGEMENT_NODE_IPAM_IP_POOL_GATEWAY parameter. see Configure Node IPAM in Management Cluster Configuration for vSphere.