vSphere Supervisor System Pod Certificate Expiry due to Cert-Manager Issues
search cancel

vSphere Supervisor System Pod Certificate Expiry due to Cert-Manager Issues

book

Article ID: 390661

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime VMware vSphere 7.0 with Tanzu vSphere with Tanzu

Issue/Introduction

System services are not working properly, returning an error message that the corresponding service or webhook has an expired certificate.

The certificate is managed by cert-manager which manages system pods in Supervisor and Workload clusters. Cert-manager does not manage Kubernetes Certificates.

 

While connected to the vCenter Server Appliance (VCSA):

  • The wcpsvc logs which manage Workload Management show similar errors to the below, where the <service address> will vary based on the system pod(s) with expired certificates:
    • cat /var/log/vmware/wcp/wcpsvc.log
    • Failed calling webhook, failing open <system service>.vmware.com: failed calling webhook "<webhook service>": failed to call webhook: Post "https://<webhook service>:<port>/<webhook service>?timeout=10s"
    • Post: "https://<service address>:<port>/convert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority
    • x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SSZ is after YYYY-MM-DDTHH:MM:SSZ"

 

While connected to the Supervisor cluster context, the following symptoms are observed:

  • System pods in the Supervisor cluster can be either Running with certificate errors or in CrashLoopBackOff state with certificate errors:
    • kubectl get pods -A
  • Checking the affected system pod logs show similar bad certificate error messages:
    • kubectl logs -n <affected pod namespace> <affected pod name>
    • kubectl logs -n <affected pod namespace> <affected pod name>
      • http: TLS handshake error from <service IP address>:<port>: remote error: tls: bad certificate
  • Kube-apiserver pod logs show that the corresponding service certificate has expired:
    • kubectl get pods -A | grep kube-apiserver
    • kubectl logs -n kube-system <kube-apiserver pod name>
      • x509: certificate has expired or is not yet valid: current time YYYY-MM-DDTHH:MM:SSZ is after YYYY-MM-DDTHH:MM:SSZ"

Environment

vSphere with Tanzu 7.0

vSphere with Tanzu 8.0

This issue can occur regardless of whether the affected cluster is managed by Tanzu Mission Control (TMC) or not

Cause

Cert-manager is responsible for automatic rotation of certificates for many vmware system and kube system pods as well as packages.

  • Note: Cert-manager is a separate certificate management service for system pods. Cert-manager does not manage Kubernetes Certificates.

If certificates are expired in system pods, then services reliant on those certificates will fail with certificate expiry errors.

Certain system pods are dependent on other system pods and can cause multiple system pods to fail with certificate errors due to the corresponding system pod's service certificate expiry.

Cert-manager will need to be looked into and fixed to restore system pod certificate management.

 

These certificates do not show up in certificate checks using the certmgr tool. The certmgr tool is only for Kubernetes Certificates.

It is expected that cert-manager will automatically renew the certificates for system and vmware pods running on Supervisor and Workload clusters prior to expiry.

However, there are circumstances (which will vary by scenario) where cert-manager fails to renew the certificates.

Resolution

The cause of the cert-manager pod failing to renew the certificates before expiry will need to be investigated, ideally.

However, the cert-manager pod can be restarted to force it to renew the certificates.

Note: Cert-manager is a separate certificate management service for system pods. Cert-manager does not manage Kubernetes Certificates.

  1. Connect into the Supervisor cluster context

  2. Check cert-manager pod logs for the cause:
    • kubectl get pods -A | grep cert-manager
    • kubectl logs -n <cert-manager-namespace> <cert-manager pod name>

  3. All cert-manager pods can be restarted using the below command:
    • kubectl rollout restart deploy -n <cert-manager-namespace>

  4. Confirm that all cert-manager pods are Running:
    • kubectl get pods -n <cert-manager-namespace>
    • Once cert-manager is Running properly, it should automatically renew any expired certificates of pods in the Supervisor cluster. However, the pods may need to be restarted in order to pick up the renewed certificates.

  5. Restart the pods through the corresponding deployment to pick up the renewed certificates:
    • kubectl get deploy -n <affected pod namespace>
    • kubectl rollout restart deploy -n <affected pod namespace> <affected pod>

  6. Check that the affected pod(s) are Running after the restart:
    • kubectl get pods -n <affected pod namespace>

  7. Confirm that the affected pods are no longer reporting certificate expiry:
    • kubectl logs -n <affected pod namespace> <affected pod name>
    • kubectl logs -n <affected pod namespace> <affected pod name>