Management Cluster Creation Stuck on The First Control-Plane Node Since DHCP Server is Not Configured With Option 3 (Router)

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

The Management Cluster creation task stuck after creating the first control-plane node in vCenter.
The first control-plane node will gets an IP address assigned to it after powering on.

+++pods/capi-system_capi-controller-manager-6c8f585878-#######_########-#####-####-####-##########/manager/0.log+++

I1114 16:37:40.344850 1 service.go:376] "VM is powered on" controller="vspherevm" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="VSphereVM" VSphereVM="tkg-system/ManagementCluster-controlplane-####" namespace="tkg-system" name="ManagementCluster-controlplane-####>" reconcileID="161d1903-####-####-####-da6090a04ec2" Cluster="tkg-system/<Management-Cluster-Name>" VSphereMachine="tkg-system/ManagementCluster-controlplane-####" VSphereCluster="tkg-system/ManagementCluster-Name-####" Machine="tkg-system/ManagementCluster-controlplane-####" KubeadmControlPlane="tkg-system/ManagementCluster-kcp"

MachineHealthCheck will keep reporting the control-plane node as unhealthy and deleted it after 20 minutes as since node failed to report startup as kubelet didn't report node status.

+++pods/capi-system_capi-controller-manager-6c8f585878-#######_########-#####-####-####-##########/manager/0.log+++

2024-11-14T16:57:41.753158617Z stderr F I1114 16:57:41.753087 1 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="tkg-system/ManagementCluster-controlplane-####" namespace="tkg-system" name="ManagementCluster-controlplane-####" reconcileID="3c623cd8-####-####-####-ae8d8c59abd7" Cluster="tkg-system/ManagementCluster" target="tkg-system/ManagementCluster-controlplane-####/ManagementCluster-controlplane-####-####/" reason="NodeStartupTimeout" message="Node failed to report startup in 20m0s”

kubelet service s not running on the control-plane node.

# systemctl status kubelet

• kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded C/usr/lib/systend/systen/kubelet.service; enabled; preset: enabled)
brop in: 38/18.51005vste/AwbeLet. servlce. d
Active: activating (auto-restart) (Result: exit-code) since Fri 2024-11-15 19:01:21 UTC; 4s ago
Docs: https://kubernetes.io/docs/home/
Activate Windows
Process: 1544 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS SKUBELET_CONFIG_ARGS $KUBELET_KUBEADMARGS SKUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
Main PID: 1544 (code=exited, status=1/FAILURE)
CPU: 38ms

There are no containers running on the control-plane node.
kubeadm on the control-plane node is faling to use the eth0 to set the advertise address for the API server since its gateway IP Address is set as "0.0.0.0".

+++/var/log/cloud-init-output.log+++

[2024-11-16 23:04:12] W1116 23:04:12.494528 982 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/containerd/cont
ainerd.sock". Please update your configuration!
[2024-11-16 23:04:12] W1116 23:04:12.494808 982 common.go:192] WARNING: could not obtain a bind address for the API Server: no default routes found in "/proc/net/route" or "/proc/net/ipv6_route"; using: 0.0.0.0

The /var/log/cloud-init-output.log on the control-plane node will show that the Gateway IP Address was set on eth0 as 0.0.0.0 .

Running the following command to display the routing table on the control-plane node, confirms that the default gateway IP Address set on the eth0 as "0.0.0.0"

# route -n

Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
xx.xx.0.0 0.0.0.0 255.255.255.xx U 1024 0 0 eth0

Environment

VMware Tanzu Kubernetes Grid Management (VMware Tanzu Kubernetes Grid Management) (TKGm)

Cause

The Management Cluster creation is faling since the control-plane node is set with default gateway 0.0.0.0.
The DHCP server scope for the Management Cluster network is not configured with With Option 3 (Router), which will be used to set the default gateway on the Management Cluster nodes.
The DHCP lease file located on the control-plane node is missing the ROUTER parameter.

# cat /run/systemd/netif/leases/2

# This is private data. Do not parse.
ADDRESS=x.x.x.5
NETMASK=255.255.255.x
SERVER_ADDRESS=10.x.x.6
T1=17018
T2=29781
LIFETIME=34036
DNS=x.x.x.8 x.x.x.9
NTP=x.x.x.1 x.x.x.2 x.x.x.3
HOSTNAME=ManagementCluster-controlplane-#####
CLIENTID=ffb6xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx109

Resolution

If the Management Cluster getting deployed using DHCP server, then the Management cluster scope on the DHCP server need to be configured with Option 3 (Router) to set the default gateway for the cluster nodes.
If the Management Cluster is been created using Node IPAM then the management cluster configuration file need to be edited to include the MANAGEMENT_NODE_IPAM_IP_POOL_GATEWAY parameter. see Configure Node IPAM in Management Cluster Configuration for vSphere.