TKG with AVI: KubeadmControlPlane controller manager pod showing ""failed to sync Machines: failed to update Machine" and "tls: failed to verify certificate: x509: certificate signed by unknown authority"

search cancel

TKG with AVI: KubeadmControlPlane controller manager pod showing ""failed to sync Machines: failed to update Machine" and "tls: failed to verify certificate: x509: certificate signed by unknown authority"

book

Article ID: 389588

calendar_today

Updated On: 03-28-2025

Products

Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid 1.x VMware Tanzu Kubernetes Grid Management VMware Tanzu Kubernetes Grid Plus VMware Tanzu Kubernetes Grid Plus 1.x

Issue/Introduction

Scenario:

Control Plane machine is stuck in Provisioning state after a patching operation.

Example patch operation that could result in this issue:

Patching a kubeadmcontrolplanes(kcp) object to update KCP Control Plane Node IP addresses.

capi-kubeadm-control-plane-controller-manager pod shows errors:

E0227 16:58:26.009281 1 controller.go:324] "Reconciler error" err="failed to sync Machines: failed to update Machine: tkg-system/<MGMT_CLUSTER_NAME>-controlplane-l2dk4-cbgvv: failed to update Machine: failed to apply Machine tkg-system/<MGMT_CLUSTER_NAME>-controlplane-l2dk4-cbgvv: Internal error occurred: failed calling webhook \"default.machine.cluster.x-k8s.io\": failed to call webhook: Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-machine?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="tkg-system/<MGMT_CLUSTER_NAME>-controlplane-l2dk4" namespace="tkg-system" name="<MGMT_CLUSTER_NAME>-controlplane-l2dk4" reconcileID=1dc9dd37-a0b9-463d-b9ff-cf402631603b

tanzu management-cluster get

Shows a 4th kcp control plane node machine trying to come up

kubectl get machines -n MGMT_CLUSTER_NAMESPACE

Shows a 4th control plane machine stuck in Provisioning state

When checking the certificates for the capi-system webhook and AVI, you confirmed that the Certificate Authority valid/known. You see that by matching both Serial Numbers from the two certificates.

Environment

VMware Tanzu Kubernetes Grid Management (TKGm) .

Using AVI Load Balancer topology

Cause

In some cases, if a control plane node is offline or down for an extended period of time, the DHCP IP address lease can expire. And when attempting to patch the kcp object, such as by changing an IP address, this issue might surface.

This can happen if and when previous etcd issues were involved, resulting in a control plane node going offline longer than the DHCP lease period.

Resolution

Perform a rolling restart of the following deployments on the management cluster

Namespace: capi-system

Deployment: capi-controller-manager

kubectl -n capi-system rollout restart deploy capi-controller-manager

Namespace: capv-system

Deployment: capv-controller-manager

kubectl -n capv-system rollout restart deploy capv-controller-manager

Namespace: capi-kubeadm-control-plane-system

capi-kubeadm-control-plane-controller-manager

kubectl -n capi-kubeadm-control-plane-system rollout restart deploy capi-kubeadm-control-plane-controller-manager

Namespace: capi-kubeadm-bootstrap-system

capi-kubeadm-bootstrap-controller-manager

kubectl -n capi-kubeadm-bootstrap-system rollout restart deploy capi-kubeadm-bootstrap-controller-manager

Namespace: tkg-system

Deployment: ALL Deployments

kubectl -n tkg-system rollout restart deploy

NOTE: due to several deployments in the tkg-system namespace, you can watch them come until they all go back into Running state with:

kubectl -n tkg-system get pods -w

Feedback

thumb_up Yes

thumb_down No