When attempting to create a new guest cluster, you will will see following warning:
Warning: Cluster refers to ClusterClass vmware-system-vks-public/builtin-generic-v3.4.0, but this ClusterClass hasn't been successfully reconciled. Cluster topology has not been fully validated. Please take a look at the ClusterClass status
cluster.cluster.x-k8s.io/cluster-name created
At the same time, you’ll observe that capi-controller-manager on the Supervisor Cluster is repeatedly crashing. The pod enters a restart loop with readiness probe failures similar to the following:
kubectl describe pod -n <capi-namespace> capi-controller-manager-xxx-xxx
Warning Unhealthy 5m22s kubelet Readiness probe failed: Get "http://198.51.100.1:9440/readyz": dial tcp 198.51.100.1:9440: connect: connection refused Normal Pulled 2m34s (x5 over 13m) kubelet Container image "projects.packages.broadcom.com/vsphere/iaas/tkg-service/3.4.1/tkg-service@sha256:###########" already present on machine Normal Created 2m34s (x5 over 13m) kubelet Created container: manager Normal Started 2m34s (x5 over 13m) kubelet Started container manager Warning BackOff 44s (x11 over 8m21s) kubelet Back-off restarting failed container manager in pod capi-controller-manager-##########-#####_#############(#####-#####-.33333)
Additionally, logs from runtime-extension-controller-manager on the Supervisor indicate that it is using expired certificates:
kubectl logs -n <runtime-extension-namespace> runtime-extension-controller-manager-xxx-xxx
2025-01-01T10:00:00.000000000Z stderr F I1000 10:00:00.0000000 1 ???:1] "http: TLS handshake error from 198.51.100.1:50478: tls: failed to verify certificate: x509: certificate signed by unknown authority" 2025-01-01T10:00:00.000000000Z stderr F I1000 10:00:00.0000000 1 ???:1] "http: TLS handshake error from 198.51.100.1:50480: tls: failed to verify certificate: x509: certificate signed by unknown authority"
The certificate used by the runtime-extension-controller-manager had expired. Although a new certificate was generated, the pod did not reload certificates while running. As a result, the controller continued to use the expired certificate until it was restarted.
Restarting the runtime-extension-controller-manager pod forces it to load the new certificate, allowing cluster creation to function again.
kubectl get deployment -A | grep runtime
kubectl rollout restart deployment -n <namespace> <runtime-extension-controller-manager-deployment-name>
After the deployment restarts, the controller loads the updated certificate. As a result, CAPI stops crashing and cluster creation should succeed.