Scenario:
Control Plane machine is stuck in Provisioning state after a patching operation.
Example patch operation that could result in this issue:
Patching a kubeadmcontrolplanes(kcp) object to update KCP Control Plane Node IP addresses.
capi-kubeadm-control-plane-controller-manager pod shows errors:
E0227 16:58:26.009281 1 controller.go:324] "Reconciler error" err="failed to sync Machines: failed to update Machine: tkg-system/<MGMT_CLUSTER_NAME>-controlplane-l2dk4-cbgvv: failed to update Machine: failed to apply Machine tkg-system/<MGMT_CLUSTER_NAME>-controlplane-l2dk4-cbgvv: Internal error occurred: failed calling webhook \"default.machine.cluster.x-k8s.io\": failed to call webhook: Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-machine?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="tkg-system/<MGMT_CLUSTER_NAME>-controlplane-l2dk4" namespace="tkg-system" name="<MGMT_CLUSTER_NAME>-controlplane-l2dk4" reconcileID=1dc9dd37-a0b9-463d-b9ff-cf402631603b
tanzu management-cluster get
Shows a 4th kcp control plane node machine trying to come up
kubectl get machines -n MGMT_CLUSTER_NAMESPACE
Shows a 4th control plane machine stuck in Provisioning state
When checking the certificates for the capi-system webhook and AVI, you confirmed that the Certificate Authority valid/known. You see that by matching both Serial Numbers from the two certificates.
VMware Tanzu Kubernetes Grid Management (TKGm) .
Using AVI Load Balancer topology
In some cases, if a control plane node is offline or down for an extended period of time, the DHCP IP address lease can expire. And when attempting to patch the kcp object, such as by changing an IP address, this issue might surface.
This can happen if and when previous etcd issues were involved, resulting in a control plane node going offline longer than the DHCP lease period.
Perform a rolling restart of the following deployments on the management cluster
Namespace: capi-system
Deployment: capi-controller-manager
kubectl -n capi-system rollout restart deploy capi-controller-manager
Namespace: capv-system
Deployment: capv-controller-manager
kubectl -n capv-system rollout restart deploy capv-controller-manager
Namespace: capi-kubeadm-control-plane-system
capi-kubeadm-control-plane-controller-manager
kubectl -n capi-kubeadm-control-plane-system rollout restart deploy capi-kubeadm-control-plane-controller-manager
Namespace: capi-kubeadm-bootstrap-system
capi-kubeadm-bootstrap-controller-manager
kubectl -n capi-kubeadm-bootstrap-system rollout restart deploy capi-kubeadm-bootstrap-controller-manager
Namespace: tkg-system
Deployment: ALL Deployments
kubectl -n tkg-system rollout restart deploy
NOTE: due to several deployments in the tkg-system namespace, you can watch them come until they all go back into Running state with:
kubectl -n tkg-system get pods -w