One of the three Supervisor Cluster control plane nodes is in a NotReady state, causing etcd to lose quorum. Containers on the affected node are in an exited state, and few pods are in a terminating state.
Error from the supervisor tab in vCenter UI
Cluster test is unhealthy:
Get "http://localhost:1080/external-cert/<supervior clone plane ip>/6443/version?timeout=2m0s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Output of kubectl get nodes shows below
root@test-1 [ ~ ]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
test-1 Ready control-plane,master 571d v1.25.6+vmware.wcp.2
test-2 NotReady control-plane,master 571d v1.25.6+vmware.wcp.2
test-3 Ready control-plane,master 571d v1.25.6+vmware.wcp.2
SSH to the Supervisor node and check the kubelet log.
journalctl -xeu kubelet
>kubelet[34395]: E0528 hh:mm:ss.ss 34395 server.go:425] "Client rotation is on, will bootstrap in background"
>kubelet[34395]: E0528 hh:mm:ss.ss 34395 bootstrap.go:265] "Client rotation is on, will bootstrap in background"
>kubelet[34395]: E0528 hh:mm:ss.ss 34395 run.go:74] "command failed" err="FAILED_TO_RUN_KUBELET: unable to load bootstrap kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory"
The node is NotReady because the kubelet service failed to start due to an expired certificate in /etc/kubernetes/kubelet.conf. This prevents the kubelet from connecting to the API server, causing containers to exit, pods to terminate, and etcd to lose quorum since only two of the three nodes are operational.
1. Ensure SSH access to all control plane VMs of the supervisor
2. Validate the cert present under client-certificate in/etc/kubernetes/kubelet.conf path, whether it is valid or not by decoding the cert.
3. If the cert is misconfigured or not valid, you can follow the below steps
4. Update /etc/kubernetes/kubelet.conf to match the structure below, ensuring the certificate paths point to a valid certificate at /var/lib/kubelet/pki/kubelet-client-current.pem
apiVersion: v1 clusters: - cluster: certificate-authority-data: [REDACTED] server: https://xx.xxx.xxx.xxx:6443 name: workload-slot1rp11 contexts: - context: cluster: workload-slot1rp11 user: system:node:workload-slot1rp11-controlplane-48jpz-69fwc name: system:node:workload-slot1rp11-controlplane-48jpz-69fwc@workload-slot1rp11 current-context: system:node:workload-slot1rp11-controlplane-48jpz-69fwc@workload-slot1rp11 kind: Config preferences: {} users: - name: system:node:workload-slot1rp11-controlplane-48jpz-69fwc user: client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
5. Restart the kubelet servicesystemctl restart kubelet.service
6. Verify the kubelet is runningsystemctl status kubelet.service
7. Confirm containers are running and pods are no longer terminating: kubectl get pods -A -o widecrictl ps -a
8. Ensure all nodes are in a Ready statekubectl get nodes -o wide