Data Services Manager Workload Cluster Certificate Expiration and Renewal

Products

VMware Data Services Manager

Issue/Introduction

After exactly one year of operation a user may be unable to connect to their workload clusters using kubectl.

The error typically indicates certificate expiration, preventing authentication to the kube-apiserver. This issue occurs when automatic certificate rotation is not enabled for the cluster.

Symptoms

- kubectl commands fail with certificate expiry errors
- Cannot access Kubernetes API after ~365 days of cluster operation
- Error message: "certificate has expired or is not yet valid"
- Database workloads remain running but cluster management is impossible

Environment

Data Service Manager (all versions) using DSM-Managed Infrastructure Policies

Cause

Kubernetes cluster certificates have a default validity period of 365 days (1 year). When these certificates expire:
1. The client certificates embedded in kubeconfig files become invalid
2. API server certificates expire, preventing secure communication
3. Controller manager, scheduler, and etcd certificates expire

Without automatic certificate rotation enabled, these certificates must be manually renewed before expiration to maintain cluster accessibility.

Resolution

Prerequisites
- SSH access to control plane nodes and provider VM
- Backup of important data (recommended)
- List of control plane node IPs, can be fetched from vCenter (1-node or 3-node based on cluster topology)

Step 1: Verify Certificate Status
To SSH to a Control Plane node, first step is to find its IP:

Fetch the sshkey from provider VM:

Fetch ssh key to CP nodes

root@vxlan###### [ /opt/vmware/tdm-provider/provisioner ]# ls
config.yaml dsm-tsql-config-from-pg.yaml registry-credentials-auth sshkey sshkey.pub

SSH to a Control Plane node and check certificate expiration:

# SSH to control plane node
ssh -i sshkey capv@<CONTROL_PLANE_IP>

# Check certificate expiration
sudo kubeadm certs check-expiration

The user should be observing expired certificates in the output:

[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Sep 30, 2025 00:53 UTC 364d ca no
apiserver Sep 30, 2025 00:53 UTC 364d ca no
apiserver-etcd-client Sep 30, 2025 00:53 UTC 364d etcd-ca no
apiserver-kubelet-client Sep 30, 2025 00:53 UTC 364d ca no
controller-manager.conf Sep 30, 2025 00:53 UTC 364d ca no
etcd-healthcheck-client Sep 30, 2025 00:53 UTC 364d etcd-ca no
etcd-peer Sep 30, 2025 00:53 UTC 364d etcd-ca no
etcd-server Sep 30, 2025 00:53 UTC 364d etcd-ca no
front-proxy-client Sep 30, 2025 00:53 UTC 364d front-proxy-ca no
scheduler.conf Sep 30, 2025 00:53 UTC 364d ca no

Step 2: Renew Certificates on Each Control Plane Node
Important: Perform this on ALL control plane nodes in the cluster.

Renew certificate:

sudo kubeadm certs renew all

Expected Output:

[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Step 3: Restart Control Plane Component
After renewing certificates, restart the control plane components:

# Restart kubelet service
sudo systemctl restart kubelet

# Wait for kubelet to stabilize
sleep 5

# Restart control plane pods
sudo crictl pods | grep kube-apiserver | awk '{print $1}' | xargs -I {} sudo crictl stopp {}
sleep 2
sudo crictl pods | grep kube-controller-manager | awk '{print $1}' | xargs -I {} sudo crictl stopp {}
sleep 2
sudo crictl pods | grep kube-scheduler | awk '{print $1}' | xargs -I {} sudo crictl stopp {}
sleep 2
sudo crictl pods | grep etcd | awk '{print $1}' | xargs -I {} sudo crictl stopp {}sleep 2 sudo crictl pods | grep kube-vip | awk '{print $1}' | xargs -I {} sudo crictl stopp {}

# Wait for pods to restart
sleep 10

Expected output after restarting Control Plane components:

52ab8173e945a 8 seconds ago Ready etcd-<node-name> kube-system 1 (default)
21f1a03ab4c7f 14 seconds ago Ready kube-scheduler-<node-name> kube-system 1 (default)
57d08437a6e0f 17 seconds ago Ready kube-controller-manager-<node-name> kube-system 1 (default)
d84ab95dedce0 20 seconds ago Ready kube-apiserver-<node-name> kube-system 1 (default)

Step 4: Update Kubeconfig Files
After certificate renewal, update your local kubeconfig:

Copy the new Kubeconfig:

# From your local machine, copy the renewed admin.conf
ssh -i sshkey capv@<CONTROL_PLANE_IP> "sudo cat /etc/kubernetes/admin.conf" > renewed-kubeconfig.yaml

# Test the new kubeconfig
kubectl --kubeconfig renewed-kubeconfig.yaml get nodes
NAME STATUS ROLES AGE VERSION
matst-gcqf-db-6c40-hxl8z Ready control-plane 8h v1.31.4+vmware.2-fips
matst-gcqf-db-6c40-lftgq Ready control-plane 8h v1.31.4+vmware.2-fips
matst-gcqf-db-6c40-xzrkv Ready control-plane 8h v1.31.4+vmware.2-fips

Impact Assessment
During testing, certificate renewal showed:
- Total process time: ~30 seconds per control plane node
- API downtime: ~3 seconds during API server restart
- Database impact: Minimal (single failed connection during API restart)
- Workload impact: Running pods continue without interruption

Monitoring During Renewal
To monitor impact during renewal, you can track database connectivity:

#!/bin/bash
# Monitor script to test connectivity during renewal
DB_URL="postgresql://user:pass@db-ip:5432/database"
while true; do
if psql "$DB_URL" -c "SELECT 1;" -t &>/dev/null; then
echo "[$(date '+%H:%M:%S')] DB OK"
else
echo "[$(date '+%H:%M:%S')] DB FAILED"
fi
sleep 0.5
done

Sample Output During Renewal:

[10:18:26] DB OK
[10:18:28] DB OK
[10:18:29] DB FAILED (failure #1)
[10:18:31] DB OK
[10:18:34] DB OK

Additional Information

Additional Notes
- Always perform certificate renewal during a maintenance window
- Test the process in a non-production environment first
- Keep backups of old kubeconfig files until renewal is confirmed successful
- For multi-node clusters, coordinate renewal across all Control Plane nodes