TMC SM failing after guest cluster upgrade with expired contour and envoy CA certificates

Products

VMware vSphere Kubernetes Service

Issue/Introduction

You upgrade your guest cluster and find that a number of pods in the TMC namespace are in CrashLoopBackOff .

You also see that the envoy pods are Running but not completely

$ kubectl get pods -n <tmc local namespace> | grep contour-envoy
contour-envoy-6lx25 1/2 Running 0 6m4s
contour-envoy-8l4pk 1/2 Running 0 6m4s
contour-envoy-k5p2d 1/2 Running 0 6m4s

Describing envoy pod shows that the readiness probe is failing

Events:
Type Reason Age From Message
---- ------ ---- ---- -------

Warning Unhealthy 101s (x102 over 6m32s) kubelet Readiness probe failed: Get "http://x.x.x.x:8002/ready": dial tcp x.x.x.x:8002: connect: connection refused

Checking the envoy as per KB, Contour Envoy Pods Failure with SSLV3_ALERT_BAD_CERTIFICATE or CERTIFICATE_VERIFY_FAILED error, you can see that the secrets containing the envoy (envoycert) and contour (contourcert) CA show expired.

k get secrets -n <tmc local namespace> envoycert -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -nout -dates
notBefore=Jan 6 17:00:01 2025 GMT
notAfter=Jan 7 17:00:01 2026 GMT
-----BEGIN CERTIFICATE-----
<cert data>
-----END CERTIFICATE-----

$ k get secrets -n <tmc local namespace> contourcert -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -nout -dates
notBefore=Jan 6 17:00:01 2025 GMT
notAfter=Jan 7 17:00:01 2026 GMT
-----BEGIN CERTIFICATE-----
<cert data>
-----END CERTIFICATE-----

However, KB, Contour Envoy Pods Failure with SSLV3_ALERT_BAD_CERTIFICATE or CERTIFICATE_VERIFY_FAILED error, does not resolve the issue

Environment

vCenter 8.0U3

VKS Guest Cluster with TMC installed and using contour for ingress.

Cause

The CA that signed the envoy/contour certificate has expired and was not properly renewed by the cert-manager system pod.

While the contour and envoy certificate objects may be renewed properly, the pods also use the corresponding secret object's CA.
If the secret's CA is not properly renewed, contour and/or envoy pods will not work properly and services using this ingress controller will fail.

In this environment, the envoy/contour certificates and CA are not managed by cert-manager and expired after one year.

Resolution

After checking and finding that the ca.crt had expired

1. Find the certgen job for contour

kubectl get jobs -A | grep contour-certgen

<tmc local namespace> contour-contour-certgen Complete 1/1 2s 372d

2. Capture the yaml for the job and also made a backup

kubectl get jobs -n <tmc local namespace> contour-contour-certgen -o yaml >contour-contour-certgen-job.yaml
kubectl get jobs -n <tmc local namespace> contour-contour-certgen -o yaml >contour-contour-certgen-job-bak.yaml

3. Edit the file contour-contour-certgen-job.yaml and clear out any events, timestamps, uids, and previous configs.

vi contour-contour-certgen-job.yaml

4. Recreate the contour-certgen job

a. After saving the file delete and recreate the contour-certgen job

kubectl delete jobs -n <tmc local namespace> contour-contour-certgen

kubectl apply -f contour-contour-certgen-job.yaml

b. Check that the job has run

kubectl get jobs -n <tmc local namespace> contour-contour-certgen

kubectl describe jobs -n <tmc local namespace> contour-contour-certgen

5. Check that the certificates in secrets envoycert and contourcert had rotated successfully

k get secrets -n <tmc local namespace> envoycert -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -nout -dates

k get secrets -n <tmc local namespace> contourcert -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -nout -dates

6. Delete each of the envoy pods in <tmc local namespace>

kubectl delete pods -n <tmc local namespace> <envoy pod name>

7. Check that envoy pods are back up and running and that the other TMC pods are back up and running

k get pods -A | grep -vE "

k get pods -A | grep -vE "Running|Complete"

Additional Information

Contour Documentation: Rotate using the contour-certgen job documentation