After exactly one year of operation a user may be unable to connect to their workload clusters using kubectl.
The error typically indicates certificate expiration, preventing authentication to the kube-apiserver. This issue occurs when automatic certificate rotation is not enabled for the cluster.
- kubectl commands fail with certificate expiry errors
- Cannot access Kubernetes API after ~365 days of cluster operation
- Error message: "certificate has expired or is not yet valid"
- Database workloads remain running but cluster management is impossible
Data Service Manager (all versions) using DSM-Managed Infrastructure Policies
Kubernetes cluster certificates have a default validity period of 365 days (1 year). When these certificates expire:
1. The client certificates embedded in kubeconfig files become invalid
2. API server certificates expire, preventing secure communication
3. Controller manager, scheduler, and etcd certificates expire
Without automatic certificate rotation enabled, these certificates must be manually renewed before expiration to maintain cluster accessibility.
Prerequisites
- SSH access to control plane nodes and provider VM
- Backup of important data (recommended)
- List of control plane node IPs, can be fetched from vCenter (1-node or 3-node based on cluster topology)
Step 1: Verify Certificate Status
To SSH to a Control Plane node, first step is to find its IP:
Fetch the sshkey from provider VM:
Fetch ssh key to CP nodes
root@vxlan###### [ /opt/vmware/tdm-provider/provisioner ]# lsconfig.yaml dsm-tsql-config-from-pg.yaml registry-credentials-auth sshkey sshkey.pub
SSH to a Control Plane node and check certificate expiration:
# SSH to control plane nodessh -i sshkey capv@<CONTROL_PLANE_IP>
# Check certificate expirationsudo kubeadm certs check-expiration
The user should be observing expired certificates in the output:
[check-expiration] Reading configuration from the cluster...[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGEDadmin.conf Sep 30, 2025 00:53 UTC 364d ca noapiserver Sep 30, 2025 00:53 UTC 364d ca noapiserver-etcd-client Sep 30, 2025 00:53 UTC 364d etcd-ca noapiserver-kubelet-client Sep 30, 2025 00:53 UTC 364d ca nocontroller-manager.conf Sep 30, 2025 00:53 UTC 364d ca noetcd-healthcheck-client Sep 30, 2025 00:53 UTC 364d etcd-ca noetcd-peer Sep 30, 2025 00:53 UTC 364d etcd-ca noetcd-server Sep 30, 2025 00:53 UTC 364d etcd-ca nofront-proxy-client Sep 30, 2025 00:53 UTC 364d front-proxy-ca noscheduler.conf Sep 30, 2025 00:53 UTC 364d ca no
Step 2: Renew Certificates on Each Control Plane Node
Important: Perform this on ALL control plane nodes in the cluster.
Renew certificate:
sudo kubeadm certs renew all
Expected Output:
[renew] Reading configuration from the cluster...[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewedcertificate for serving the Kubernetes API renewedcertificate the apiserver uses to access etcd renewedcertificate for the API server to connect to kubelet renewedcertificate embedded in the kubeconfig file for the controller manager to use renewedcertificate for liveness probes to healthcheck etcd renewedcertificate for etcd nodes to communicate with each other renewedcertificate for serving etcd renewedcertificate for the front proxy client renewedcertificate embedded in the kubeconfig file for the scheduler manager to use renewed
Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.
Step 3: Restart Control Plane Component
After renewing certificates, restart the control plane components:
# Restart kubelet servicesudo systemctl restart kubelet
# Wait for kubelet to stabilizesleep 5
# Restart control plane podssudo crictl pods | grep kube-apiserver | awk '{print $1}' | xargs -I {} sudo crictl stopp {}sleep 2sudo crictl pods | grep kube-controller-manager | awk '{print $1}' | xargs -I {} sudo crictl stopp {}sleep 2sudo crictl pods | grep kube-scheduler | awk '{print $1}' | xargs -I {} sudo crictl stopp {}sleep 2sudo crictl pods | grep etcd | awk '{print $1}' | xargs -I {} sudo crictl stopp {}sleep 2
sudo crictl pods | grep kube-vip | awk '{print $1}' | xargs -I {} sudo crictl stopp {}
# Wait for pods to restartsleep 10
# Verify pods are runningsudo crictl pods | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd|kube-vip"
Expected output after restarting Control Plane components:
52ab8173e945a 8 seconds ago Ready etcd-<node-name> kube-system 1 (default)21f1a03ab4c7f 14 seconds ago Ready kube-scheduler-<node-name> kube-system 1 (default)57d08437a6e0f 17 seconds ago Ready kube-controller-manager-<node-name> kube-system 1 (default)d84ab95dedce0 20 seconds ago Ready kube-apiserver-<node-name> kube-system 1 (default)
Step 4: Update Kubeconfig Files
After certificate renewal, update your local kubeconfig:
Copy the new Kubeconfig:
# From your local machine, copy the renewed admin.confssh -i sshkey capv@<CONTROL_PLANE_IP> "sudo cat /etc/kubernetes/admin.conf" > renewed-kubeconfig.yaml
# Test the new kubeconfigkubectl --kubeconfig renewed-kubeconfig.yaml get nodesNAME STATUS ROLES AGE VERSIONmatst-gcqf-db-6c40-hxl8z Ready control-plane 8h v1.31.4+vmware.2-fipsmatst-gcqf-db-6c40-lftgq Ready control-plane 8h v1.31.4+vmware.2-fipsmatst-gcqf-db-6c40-xzrkv Ready control-plane 8h v1.31.4+vmware.2-fips
Impact Assessment
During testing, certificate renewal showed:
- Total process time: ~30 seconds per control plane node
- API downtime: ~3 seconds during API server restart
- Database impact: Minimal (single failed connection during API restart)
- Workload impact: Running pods continue without interruption
Monitoring During Renewal
To monitor impact during renewal, you can track database connectivity:
#!/bin/bash# Monitor script to test connectivity during renewalDB_URL="postgresql://user:pass@db-ip:5432/database"while true; doif psql "$DB_URL" -c "SELECT 1;" -t &>/dev/null; thenecho "[$(date '+%H:%M:%S')] DB OK"elseecho "[$(date '+%H:%M:%S')] DB FAILED"fisleep 0.5done
Sample Output During Renewal:
[10:18:26] DB OK[10:18:28] DB OK[10:18:29] DB FAILED (failure #1)[10:18:31] DB OK[10:18:34] DB OK
Additional Notes
- Always perform certificate renewal during a maintenance window
- Test the process in a non-production environment first
- Keep backups of old kubeconfig files until renewal is confirmed successful
- For multi-node clusters, coordinate renewal across all Control Plane nodes