Manually Replace vSphere Kubernetes Cluster/Guest Cluster Certificates

Products

VMware vSphere Kubernetes Service VMware vSphere 7.0 with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

When the Kubernetes certificates on the vSphere Kubernetes Cluster (also known as Guest Cluster) Control Plane VMs have expired:

Users will not be able to log into the Guest Clusters with the kubectl vsphere login command.
Guest Cluster control plane nodes will be unable to manage workloads.

The recommended approach to replace certificates is documented in the article: Replace vSphere with Tanzu Guest Cluster Certificates

This KB article is intended to be used only when the above certmgr script fails to rotate the certificates.

The certmgr script may fail to renew certificates due to:
- Failing etcd and kube-apiserver pods in the affected cluster
- Certificates already expired in the affected cluster
- vmware-system-user service account expiry

The certmgr script returns error messages similar to the following which indicates that the certificate for the etcd process has expired:

HH:MM:SS etcd_actions.go:66: etcd still not healthy result {"level":"warn","ts":YYYY-MM-DDTHH:MM:SS.sssZ","logger":"etcd-client","caller":"v#@v#.#.#/retry_interceptor.go:62","msg":"retrying of unary invokre failed","target":"etcd-endpoints://0xc#######/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time YYYY-MM-DDT:HH:MM:SSZ is after YYYY-MM-DDTHH:MM:SSZ\""}
Error: context deadline exceeded

time="YYYY-MM-DDTHH:MM:SSZ" level=fatal msg="execing command in container: command terminated with exit code 1"

There is not a time sync issue in the environment.
From jump box unable to access TKC or Guest clusters.
kubectl commands are failing with error "tls: failed to verify certificate: x509: certificate has expired or is not yet valid"
- - couldn't get current server API group list: Get "https://IP_Address:6443/api?timeout=32s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SS+00:00 is after YYYY-MM-DDTHH:MM:SSZ

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

Certificates expire regardless of whether or not this cluster is managed by TMC.

Cause

Kubernetes has a default certificate expiration time of 1 year.

VMware by Broadcom products adhere to this certificate expiration timeframe.

Resolution

Certificates can be manually rotated from within the affected vSphere Kubernetes (also known as guest cluster) using kubeadm.

If you are looking for Supervisor Cluster certificate rotation, please see: Replace vSphere with Tanzu Supervisor Certificates

Prerequisites

This manual certificate renewal KB article requires that the VMware by Broadcom Technical Support's breakglass system account user vmware-system-user is not expired for the cluster.

If this vmware-system-user account and guest cluster certificates are expired, please reach out to VMware by Broadcom Technical Support referencing this KB article.

Manual Renewal Steps

SSH into one of the affected cluster's control plane nodes as vmware-system-user:
- Documentation: SSH to Tanzu Kubernetes Cluster Nodes as the System User Using a Password
Confirm on the status of the certificates on this affected cluster's control plane node:
- ```
kubeadm certs check-expiration
```
Perform the manual certificate rotation:
- ```
kubeadm certs renew all
```
Check that certificates renewed properly:
- ```
kubeadm certs check-expiration
```
Retrieve the container IDs (first column IDs) for the following processes:
- ```
crictl ps | egrep "CONTAINER|sched|kube-controller|apiserver|etcd"
```
- Note: The above "crictl ps" command only outputs containers in Running state.
- When certificates are expired, kube-apiserver and etcd containers will continue to crash repeatedly until the certificates are renewed.
- If kube-apiserver and etcd are currently down, these containers may pick up the renewed certificates on their next start.
- The system service kubelet continuously tries to start downed kube-apiserver and etcd every 5 minutes.
Use crictl stop to shut down the above container IDs in the following order so that these processes can use the new certificates.
- The system will automatically start these containers back up:
- ```
crictl stop <kube-scheduler container id>
```
- ```
crictl stop <kube-controller container id>
```
- ```
crictl stop <kube-apiserver container id>
```
- ```
crictl stop <etcd container id>
```
For each control plane node in the affected guest cluster, repeat the above steps before proceeding with the next step.
- Because these are manual steps, the certificate renewal and container restarts must be performed directly on each control plane node for the affected cluster.
- Please ensure that certificates are renewed and the above noted containers are restarted across all control plane nodes.
- If this is done improperly, there will still be certificate errors and crashing processes due to certificate expiry in the cluster on one or more control plane nodes.
IMPORTANT: This manual rotation does not rotate all certificates. The certmgr script will need to be run on the affected cluster to finish rotating all certificates:
- Replace Guest Cluster Certificates KB using Certmgr
- It is expected for the certmgr script to rotate all certificates successfully. There is no harm in running the certmgr script multiple times.
- The status of certificates can be checked through the certmgr script using the following command:
  - ```
  ./certmgr tkc certificates list -n <affected cluster namespace> <affected cluster name>
```
- If the certmgr script fails again, please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance.

Additional Information

How to rotate certificates in a Tanzu Kubernetes Grid cluster