Supervisor controlPlane Node NotReady Error "unable to load bootstrap kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory"

search cancel

Supervisor controlPlane Node NotReady Error "unable to load bootstrap kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory"

book

Article ID: 399217

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

One of the three Supervisor Cluster control plane nodes is in a NotReady state, causing etcd to lose quorum. Containers on the affected node are in an exited state, and few pods are in a terminating state.

Error from the supervisor tab in vCenter UI

Cluster test is unhealthy:
Get "http://localhost:1080/external-cert/<supervior clone plane ip>/6443/version?timeout=2m0s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Output of kubectl get nodes shows below

root@### [ ~ ]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
<node-1> Ready control-plane,master 571d v1.25.6+vmware.wcp.2
<node-2> NotReady control-plane,master 571d v1.25.6+vmware.wcp.2
<node-3> Ready control-plane,master 571d v1.25.6+vmware.wcp.2

SSH to the Supervisor node and check the kubelet log using command "journalctl -xeu kubelet"

>kubelet[34395]: E0528 hh:mm:ss.ss 34395 server.go:425] "Client rotation is on, will bootstrap in background"
>kubelet[34395]: E0528 hh:mm:ss.ss 34395 bootstrap.go:265] "Client rotation is on, will bootstrap in background"
>kubelet[34395]: E0528 hh:mm:ss.ss 34395 run.go:74] "command failed" err="FAILED_TO_RUN_KUBELET: unable to load bootstrap kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory"

Environment

VMware vSphere Kubernetes Service
vSphere with Tanzu 8.x

Cause

The node is NotReady because the kubelet service failed to start due to an expired certificate in /etc/kubernetes/kubelet.conf. This prevents the kubelet from connecting to the API server, causing containers to exit, pods to terminate, and etcd to lose quorum since only two of the three nodes are operational.

Resolution

Take SSH/Putty session to all Supervisor Control Plane Node/VMs.
Validate the kubelet certificate is valid using below command
- openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout
Edit kubelet.conf file on the affected Supervisor Control Plane Node with below changes:
- vi /etc/kubernetes/kubelet.conf
- Change "client-certificate-data" to "client-certificate" and "client-key-data" to "client-key"
- Make sure client-certificate and client-key points to a valid certificate path at /var/lib/kubelet/pki/kubelet-client-current.pem instead of certificate content in double encoded format.
- Before (Old Configuration)
  - client-certificate-data: #.....<Certificate-content in double encoded format>
  - client-key-data: #....<Certificate-content in double encoded format>
- After (New Configuration)
  - apiVersion: v1
    clusters:
    - cluster:
    certificate-authority-data: [REDACTED]
    server: https://xx.xxx.xxx.xxx:6443
    name: <cluster name>
    contexts:
    - context:
    cluster: <cluster name>
    user: system:node:<cluster name>-controlplane-xxxxx
    name: system:node:<cluster name>-controlplane-xxxxx@<cluster name>
    current-context: system:node:<cluster name>-controlplane-xxxxx@<cluster name>
    kind: Config
    preferences: {}
    users:
    - name: system:node:<cluster name>-controlplane-xxxxx
    user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
Restart the kubelet service
- systemctl restart kubelet.service
Verify the kubelet is running
- systemctl status kubelet.service
Confirm containers and pods are in Running state
- kubectl get pods -A -o wide
- crictl ps -a
Ensure all nodes are in a Ready state
- kubectl get nodes -o wide

Feedback

thumb_up Yes

thumb_down No