Supervisor Cluster Unreachable on Port 6443 Due to Expired Kubelet Certificate

Products

VMware vSphere Kubernetes Service

Issue/Introduction

One of the three Supervisor Cluster control plane nodes is in a NotReady state, causing etcd to lose quorum. Containers on the affected node are in an exited state, and few pods are in a terminating state. The cluster is unreachable on port 6443, with functionality ceasing approximately two days prior to the report, and no known environmental changes. While ping and tracert to the IP address (e.g., 10.x.x.x) are successful, connection to port 6443 fails, as evidenced by curl command output similar to:

curl --insecure https://10.x.x.x:6443/healthz

>>curl: (28) Failed to connect to 10.x.x.x port 6443 after 21051 ms: Could not connect to server

Errors observed in the journalctl -xeu kubelet logs on the affected node include:

E1008 <> bootstrap.go:266] part of the existing bootstrap client certificate in /etc/kubernetes/kubelet.conf is expired: 2025-09-25 21:20:02 +0000 UTC
E1008 <> run.go:74] "command failed" err="failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory"
kubelet.service: Main process exited, code=exited, status=1/FAILURE
kubelet.service: Failed with result 'exit-code'.

Additionally, the output of kubectl get nodes may show a node in a NotReady state:

# kubectl get nodes

NAME         STATUS     ROLES                         AGE     VERSION
test-1       Ready      control-plane,master          571d    v1.25.6+vmware.wcp.2
test-2       NotReady   control-plane,master          571d    v1.25.6+vmware.wcp.2
test-3       Ready      control-plane,master          571d    v1.25.6+vmware.wcp.2

Cause

The underlying cause of the issue is an expired kubelet client certificate, which is hard-coded into the /etc/kubernetes/kubelet.conf file on the affected control plane node. This prevents the kubelet service from starting and establishing a connection to the API server on port 6443. Other control plane nodes are correctly referencing rotated and valid certificates, indicating an inconsistency in the configuration of the affected node.

Resolution

Ensure SSH access to all control plane VMs of the supervisor.
Copy the /etc/kubernetes/kubelet.conf file from a healthy control plane node (where certificates are correctly referenced as /var/lib/kubelet/pki/kubelet-client-current.pem) to the affected node. This will replace the hard-coded expired certificates with the correct configuration.
Restart the kubelet service on the affected node:
systemctl restart kubelet.service
Verify the kubelet service is running on the affected node:
systemctl status kubelet.service
Confirm containers are running and pods are no longer terminating:
kubectl get pods -A -o wide
crictl ps -a
Ensure all nodes are in a Ready state:
kubectl get nodes -o wide