vSphere Supervisor System Pod Certificate Expiry due to Cert-Manager Issues

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime VMware vSphere 7.0 with Tanzu vSphere with Tanzu

Issue/Introduction

System services are not working properly, returning an error message that the corresponding service or webhook has an expired certificate.

The certificate is managed by cert-manager which manages system pods in Supervisor and Workload clusters. Cert-manager does not manage Kubernetes Certificates.

While connected to the vCenter Server Appliance (VCSA):

The wcpsvc logs which manage Workload Management show similar errors to the below, where the <service address> will vary based on the system pod(s) with expired certificates:

cat /var/log/vmware/wcp/wcpsvc.log

Failed calling webhook, failing open <system service>.vmware.com: failed calling webhook "<webhook service>": failed to call webhook: Post "https://<webhook service>:<port>/<webhook service>?timeout=10s"

Post: "https://<service address>:<port>/convert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority

x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SSZ is after YYYY-MM-DDTHH:MM:SSZ"

While connected to the Supervisor cluster context, one or more of the following symptoms are observed:

Workload clusters are stuck in deleting state and the deletion is not progressing.

Describing the stuck workload cluster shows certificate error similar to the following where values in brackets <> will vary by environment:

kubectl describe tkc <workload cluster name> -n <workload cluster namespace>

kubectl describe cluster <workload cluster name> -n <workload cluster namespace>

Message: error reconciling the Cluster topology: failed to create patch helper for MachineHealthCheck/<workload cluster node>: server side apply dry-run failed for modified object: Internal error occurred: failed calling webhook "default.machinehealthcheck.cluster.x-k8s.io": failed to call webhook: Post "https://capi-webhook-service.<capi system pod namespace>.svc:443/mutate-cluster-x-k8s-io-v1beta1-machinehealthcheck?timeout=10s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SSZ is after YYYY-MM-DDTHH:MM:SSZ
Reason: TopologyReconcileFailed

System pods in the Supervisor cluster can be either Running with certificate errors or in CrashLoopBackOff state with certificate errors:
```
kubectl get pods -A
```

Checking the affected system pod logs show similar bad certificate error messages for the service listed in the above certificate expiry errors:

kubectl logs -n <affected pod namespace> <affected pod name>

http: TLS handshake error from <service IP address>:<port>: remote error: tls: bad certificate

Kube-apiserver pod logs show that the corresponding service certificate has expired:

kubectl get pods -A | grep kube-apiserver

kubectl logs -n kube-system <kube-apiserver pod name>

x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SSZ is after YYYY-MM-DDTHH:MM:SSZ"

There are Kubernetes Certificate objects which have expired or have a NotAfter date that has expired despite the certificate's status:
```
kubectl get certificates -A

kubectl get certificates -A -o yaml | grep After
Not After: YYYY-MM-DD
```

Environment

vSphere Supervisor

This issue can occur regardless of whether the affected cluster is managed by Tanzu Mission Control (TMC) or not

Cause

Cert-manager is responsible for automatic rotation of certificates for many vmware system and kube system pods as well as packages.

Note: Cert-manager is a separate certificate management service for system pods in the Supervisor Cluster. Cert-manager does not manage Kubernetes Certificates.

If certificates are expired in system pods, then services reliant on those certificates will fail with certificate expiry errors.

Certain system pods are dependent on other system pods and can cause multiple system pods to fail with certificate errors due to the corresponding system pod's service certificate expiry.

Cert-manager will need to be looked into and fixed to restore system pod certificate management.

These certificates do not show up in certificate checks using the certmgr tool. The certmgr tool is only for Kubernetes Certificates.

It is expected that cert-manager will automatically renew the certificates for system and vmware pods running on Supervisor cluster prior to expiry.

However, there are circumstances (which will vary by scenario) where cert-manager fails to renew the certificates.

Resolution

The cause of the cert-manager pod failing to renew the certificates before expiry will need to be investigated, ideally.

However, the cert-manager pod can be restarted to force it to renew the certificates.

Note: Cert-manager is a separate certificate management service for system pods in the Supervisor Cluster. Cert-manager does not manage Kubernetes Certificates.

Connect into the Supervisor cluster context
Note down Kubernetes Certificates which show as expired or have Not After dates that have already passed:
```
kubectl get certificates -A

kubectl get certificates -A -o yaml | grep After
Not After: YYYY-MM-DD
```

Check cert-manager pod logs for the cause:

kubectl get pods -A | grep cert-manager

kubectl logs -n <cert-manager-namespace> <cert-manager pod name>

All cert-manager pods can be restarted using the below command:
```
kubectl rollout restart deploy -n <cert-manager-namespace>
```
Confirm that all cert-manager pods are Running:
```
kubectl get pods -n <cert-manager-namespace>
```
Once cert-manager is Running properly, it should automatically renew any expired certificates of pods in the Supervisor cluster. However, the pods may need to be restarted in order to pick up the renewed certificates.

Check that Kubernetes Certificates no longer show as expired or have an expired Not After date:

kubectl get certificates -A

kubectl get certificates -A -o yaml | grep After
Not After: YYYY-MM-DD

Restart the pods through the corresponding deployment to pick up the renewed certificates:

kubectl get deploy -n <affected pod namespace>

kubectl rollout restart deploy -n <affected pod namespace> <affected pod's deployment>

Check that the affected pod(s) are Running after the restart:
```
kubectl get pods -n <affected pod namespace>
```
Confirm that the affected pods are no longer reporting certificate expiry:
```
kubectl logs -n <affected pod namespace> <affected pod name>
```