Supervisor controlPlane Node NotReady Error "unable to load bootstrap kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory"
search cancel

Supervisor controlPlane Node NotReady Error "unable to load bootstrap kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory"

book

Article ID: 399217

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

One of the three Supervisor Cluster control plane nodes is in a NotReady state, causing etcd to lose quorum. Containers on the affected node are in an exited state, and few pods are in a terminating state.

Error from the supervisor tab in vCenter UI 

Cluster test is unhealthy:
Get "http://localhost:1080/external-cert/<supervior clone plane ip>/6443/version?timeout=2m0s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Output of kubectl get nodes shows below

root@test-1 [ ~ ]# kubectl get nodes
NAME     STATUS     ROLES                   AGE   VERSION
test-1   Ready      control-plane,master    571d  v1.25.6+vmware.wcp.2
test-2   NotReady   control-plane,master    571d  v1.25.6+vmware.wcp.2
test-3   Ready      control-plane,master    571d  v1.25.6+vmware.wcp.2

SSH to the Supervisor node and check the kubelet log.

journalctl -xeu kubelet
>kubelet[34395]: E0528 hh:mm:ss.ss  34395 server.go:425] "Client rotation is on, will bootstrap in background"
>kubelet[34395]: E0528 hh:mm:ss.ss  34395 bootstrap.go:265] "Client rotation is on, will bootstrap in background"
>kubelet[34395]: E0528 hh:mm:ss.ss  34395 run.go:74] "command failed" err="FAILED_TO_RUN_KUBELET: unable to load bootstrap kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory"

Environment

  • VMware vSphere Kubernetes Service
  • vSphere with Tanzu 8.x 
     

Cause

The node is NotReady because the kubelet service failed to start due to an expired certificate in /etc/kubernetes/kubelet.conf. This prevents the kubelet from connecting to the API server, causing containers to exit, pods to terminate, and etcd to lose quorum since only two of the three nodes are operational.

Resolution

1. Ensure SSH access to all control plane VMs of the supervisor

2. Update /etc/kubernetes/kubelet.conf to match the structure below, ensuring the certificate paths point to a valid certificate at /var/lib/kubelet/pki/kubelet-client-current.pem

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: [REDACTED]
    server: https://xx.xxx.xxx.xxx:6443
  name: workload-slot1rp11
contexts:
- context:
    cluster: workload-slot1rp11
    user: system:node:workload-slot1rp11-controlplane-48jpz-69fwc
  name: system:node:workload-slot1rp11-controlplane-48jpz-69fwc@workload-slot1rp11
current-context: system:node:workload-slot1rp11-controlplane-48jpz-69fwc@workload-slot1rp11
kind: Config
preferences: {}
users:
- name: system:node:workload-slot1rp11-controlplane-48jpz-69fwc
  user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem

 

3. Restart the kubelet service
systemctl restart kubelet.service

4. Verify the kubelet is running
systemctl status kubelet.service

5. Confirm containers are running and pods are no longer terminating:
kubectl get pods -o wide
crictl ps -a

6. Check if etcd quorum is restored following KB.

7. Ensure all nodes are in a Ready state
kubectl get nodes -o wide