While trying to create a new nodepool on TKGS clusters, the new nodepool is stuck in "Waiting".
80u3
There was x509 certificate expiry for the tkr-resolver-cluster-webhook in vmware-system-tkg
grep -i "2024-08.*x509.*tkgs-cluster" vmware-system-tkg_vmware-system-tkg-controller-manager-*/manager/0.log | head -1
2024-08-15T07:46:42.342487103Z stderr F E0815 07:46:42.342398 1 tanzukubernetescluster_controller.go:462] vmware-system-tkg-controller-manager/tanzukubernetescluster-spec-controller/tkgs-cluster-ns/tkgs-cluster "msg"="Error while reconcilling cluster object requeuing for retry" "error"="Internal error occurred: failed calling webhook \"tkr-resolver-cluster-webhook.tanzu.vmware.com\": failed to call webhook: Post \"https://tkr-resolver-cluster-webhook-service.vmware-system-tkg.svc:443/mutate-cluster?timeout=10s\": x509: certificate has expired or is not yet valid: current time 2024-08-15T07:46:42Z is after 2024-07-25T16:43:20Z" "cluster.name"="tkgs-cluster"
The certificate is held in the secret "tkr-resolver-cluster-webhook-service-cert". The certificate itself is managed by cert-manager and should be autorotated
1. Check tkr-resolver-cluster-webhook-manager components
kubectl get -n vmware-system-tkg deployment/tkr-resolver-cluster-webhook-manager
kubectl get secret -n vmware-system-tkg tkr-resolver-cluster-webhook-service-cert
2. Confirm that the Certificate is managed by cert-manager and that it has expired
kubectl get secret tkr-resolver-cluster-webhook-service-cert -n vmware-system-tkg -o yaml | grep cert-manager
kubectl get certificate -n vmware-system-tkg tkr-resolver-cluster-webhook-serving-cert -o yaml | egrep "cert-manager|notAfter|notBefore|renewalTime"
3. Check cert manager pods are running. Scale down and up to restart
kubectl get -n vmware-system-cert-manager pods
kubectl -n vmware-system-cert-manager scale deployments.apps --all --replicas=0
kubectl -n vmware-system-cert-manager scale deployments.apps --all --replicas=1
kubectl get -n vmware-system-cert-manager pods
4. Confirm that the Certificate is managed by cert-manager and that it has now rotated
kubectl get secret tkr-resolver-cluster-webhook-service-cert -n vmware-system-tkg -o yaml | grep cert-manager
kubectl get certificate -n vmware-system-tkg tkr-resolver-cluster-webhook-serving-cert -o yaml | egrep "cert-manager|notAfter|notBefore|renewalTime"
5. If restarting the cert-manager has not forced a rotation of the certificate:
# only proceed if there cert-manager is annotated and running and the cert has expired
# delete tkr-resolver-cluster-webhook-service-cert secret - cert-manager should recreate
kubectl delete secret -n vmware-system-tkg tkr-resolver-cluster-webhook-service-cert && sleep 4
6. Check tkr-resolver-cluster-webhook-service-cert secret is recreated
kubectl get -n vmware-system-tkg secret tkr-resolver-cluster-webhook-service-cert && sleep 4
7. If required, Scale deployment tkr-resolver-cluster-webhook-manager
kubectl scale -n vmware-system-tkg deployment/tkr-resolver-cluster-webhook-manager --replicas=0
kubectl scale -n vmware-system-tkg deployment/tkr-resolver-cluster-webhook-manager --replicas=1
You can confirm, at any time, the certificate's state and that it is managed by cert-manager by running the following command
kubectl get secret tkr-resolver-cluster-webhook-service-cert -n vmware-system-tkg -o yaml | grep cert-manager
kubectl get certificate -n vmware-system-tkg tkr-resolver-cluster-webhook-serving-cert -o yaml | egrep "cert-manager|notAfter|notBefore|renewalTime"
Related KB 313000: Failed calling CAPI webhook :x509:certificate has expired or is not yet valid
Failed calling CAPI webhook :x509:certificate has expired or is not yet valid (broadcom.com)