TKGS Cluster Reconcile was failing due to x509 certificate expiry for tkr-resolver-cluster-webhook in vmware-system-tkg
search cancel

TKGS Cluster Reconcile was failing due to x509 certificate expiry for tkr-resolver-cluster-webhook in vmware-system-tkg

book

Article ID: 375172

calendar_today

Updated On:

Products

VMware vSphere with Tanzu

Issue/Introduction

While trying to create a new nodepool on TKGS clusters, the new nodepool is stuck in "Waiting".

 

Environment

80u3

Cause

There was  x509 certificate expiry for the tkr-resolver-cluster-webhook  in vmware-system-tkg

grep -i "2024-08.*x509.*tkgs-cluster" vmware-system-tkg_vmware-system-tkg-controller-manager-*/manager/0.log | head -1
2024-08-15T07:46:42.342487103Z stderr F E0815 07:46:42.342398       1 tanzukubernetescluster_controller.go:462] vmware-system-tkg-controller-manager/tanzukubernetescluster-spec-controller/tkgs-cluster-ns/tkgs-cluster "msg"="Error while reconcilling cluster object requeuing for retry" "error"="Internal error occurred: failed calling webhook \"tkr-resolver-cluster-webhook.tanzu.vmware.com\": failed to call webhook: Post \"https://tkr-resolver-cluster-webhook-service.vmware-system-tkg.svc:443/mutate-cluster?timeout=10s\": x509: certificate has expired or is not yet valid: current time 2024-08-15T07:46:42Z is after 2024-07-25T16:43:20Z" "cluster.name"="tkgs-cluster"

Resolution

The certificate is held in  the secret  "tkr-resolver-cluster-webhook-service-cert". The certificate itself is managed by cert-manager and should be autorotated

1. Check  tkr-resolver-cluster-webhook-manager components


kubectl get  -n vmware-system-tkg deployment/tkr-resolver-cluster-webhook-manager 
kubectl get  secret  -n vmware-system-tkg tkr-resolver-cluster-webhook-service-cert

2. Confirm that the Certificate is managed by cert-manager and that it has expired

kubectl get secret tkr-resolver-cluster-webhook-service-cert -n vmware-system-tkg -o yaml | grep cert-manager
kubectl get certificate  -n vmware-system-tkg tkr-resolver-cluster-webhook-serving-cert -o yaml | egrep "cert-manager|notAfter|notBefore|renewalTime"
 

3. Check cert manager pods are running. Scale down and up to restart


kubectl get  -n vmware-system-cert-manager  pods 
kubectl -n vmware-system-cert-manager scale deployments.apps --all --replicas=0
kubectl -n vmware-system-cert-manager scale deployments.apps --all --replicas=1
kubectl get  -n vmware-system-cert-manager  pods

4. Confirm that the Certificate is managed by cert-manager and that it has now rotated

kubectl get secret tkr-resolver-cluster-webhook-service-cert -n vmware-system-tkg -o yaml | grep cert-manager
kubectl get certificate  -n vmware-system-tkg tkr-resolver-cluster-webhook-serving-cert -o yaml | egrep "cert-manager|notAfter|notBefore|renewalTime"

5. If restarting the cert-manager has not forced a rotation of the certificate:

# only proceed if there cert-manager is annotated and running and the cert has expired
# delete tkr-resolver-cluster-webhook-service-cert secret - cert-manager should recreate
kubectl delete   secret -n vmware-system-tkg tkr-resolver-cluster-webhook-service-cert && sleep 4

6. Check tkr-resolver-cluster-webhook-service-cert secret is recreated 

kubectl get  -n vmware-system-tkg secret tkr-resolver-cluster-webhook-service-cert && sleep 4

7. If required, Scale deployment tkr-resolver-cluster-webhook-manager


kubectl scale  -n vmware-system-tkg deployment/tkr-resolver-cluster-webhook-manager --replicas=0 
kubectl scale  -n vmware-system-tkg deployment/tkr-resolver-cluster-webhook-manager --replicas=1

 

You can confirm, at any time,  the certificate's state and that it is managed by cert-manager  by running the following command

kubectl get secret tkr-resolver-cluster-webhook-service-cert -n vmware-system-tkg -o yaml | grep cert-manager
kubectl get certificate  -n vmware-system-tkg tkr-resolver-cluster-webhook-serving-cert -o yaml | egrep "cert-manager|notAfter|notBefore|renewalTime"

Additional Information

Related KB 313000: Failed calling CAPI webhook :x509:certificate has expired or is not yet valid

Failed calling CAPI webhook :x509:certificate has expired or is not yet valid (broadcom.com)