vSphere Supervisor Workload Cluster Error: clusterclass is not successfully reconciled: status of the VariablesReconciled condition on ClusterClass must be "True"

Products

VMware vSphere Kubernetes Service

Issue/Introduction

All VKSCluster operation (e.g.,Scaling, Creating, upgrade) will all not start .

EX: Horizontally scale a VKS cluster worker nodes count by changing the number of nodes will not start.

- The vks cluster machine deployment (MD) object will be "Running" PHASE status and not "ScalingUp/ScalingDown" PHASE and the number replicas didn't chage.

# kubectl get md <clusterName> -n <namespace>

NAME CLUSTER REPLICAS READY UPDATED UNAVAILABLE PHASE AGE VERSION
machinedeployment.cluster.x-k8s.io/clusterName-worker-l9crz dmz-prod-cls01 5 5 5 0 Running 148d v1.33.1+vmware.1-fips

The describe of the cluster kubernetes object for the vks cluster will show Message "ClusterClass is not successfully reconciled: status of VariablesReconciled condition on ClusterClass must be "True""

# kubectl describe cluster <clusterName> -n <namespace>

Message:
Observed Generation: 12
Reason: Available
Status: True
Type: WorkersAvailable
Last Transition Time: 2025-12-06T05:33:38Z
Message: ClusterClass is not successfully reconciled: status of VariablesReconciled condition on ClusterClass must be "True"
Observed Generation: 12
Reason: ReconcileFailed
Status: False
Type: TopologyReconciled
Last Transition Time: 2025-10-16T10:53:47Z
Message:
Observed Generation: 12
Reason: NotRollingOut
Status: False
Type: RollingOut
Last Transition Time: 2025-07-24T12:55:08Z

Running the following command show VariableDiscovery of the ClusterClass is faling since the connection to the runtime-extension-webhook-service.svc-tkg-domain-c### service is faling with error "unknown certificate authority"

# kubectl get cc -n svc-tkg-domain-c### builtin-generic-v3.3.0 -o jsonpath='{.status.conditions}' | jq
[
{
"lastTransitionTime": "2025-08-18T03:43:44Z",
"status": "True",
"type": "RefVersionsUpToDate"
},
{
"lastTransitionTime": "2026-01-24T09:46:21Z",
"message": "VariableDiscovery failed: failed to call DiscoverVariables for patch default: failed to call extension handler \"discover-variables.runtime-extension\": http call failed: Post \"https://runtime-extension-webhook-service.svc-tkg-domain-c8.svc:443/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovervariables/discover-variables?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"x509: invalid signature: parent certificate cannot sign this kind of certificate\" while trying to verify candidate authority certificate \"serial:340174157205981460336478744338522218632\")",
"reason": "VariableDiscoveryFailed",
"severity": "Error",
"status": "False",
"type": "VariablesReconciled"
}
]
The runtime-extension-controller-manager-########## pod logs was showing TLS error "failed to verify certificate: x509: certificate signed by unknown authority"

# kubectl logs -n svc-tkg-domain-## runtime-extension-controller-manager-###########

I1219 23:02:20.211585 1 ???:1] "http: TLS handshake error from 10.#.#.12:58377: tls: failed to verify certificate: x509: certificate signed by unknown authority"
I1219 23:02:53.945972 1 ???:1] "http: TLS handshake error from 10.#.#.12:2134: tls: failed to verify certificate: x509: certificate signed by unknown authority"
I1219 23:03:03.029886 1 ???:1] "http: TLS handshake error from 10.#.#.12:25239: tls: failed to verify certificate: x509: certificate signed by unknown authority"
The capi-controller-manager pod logs showing connection to the runtime-extension-webhook-service.svc-tkg-domain-c8 service IP is faling with TLS error "unknown certificate authority"

nHandler="discover-variables.runtime-extension" hook="DiscoverVariables"
E1219 19:33:36.604631 1 controller.go:347] "Reconciler error" err="failed to discover variables for ClusterClass builtin-generic-v3.1.0: failed to call DiscoverVariables for patch default: failed to call extension handler \"discover-variables.runtime-extension\": http call failed: Post \"https://runtime-extension-webhook-service.svc-tkg-domain-c8.svc:443/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovervariables/discover-variables?timeout=10s\": remote error: tls: unknown certificate authority" controller="clusterclass" controllerGroup="cluster.x-k8s.io" controllerKind="ClusterClass" ClusterClass="vmware-system-monitoring/builtin-generic-v3.1.0" namespace="vmware-system-monitoring" name="builtin-generic-v3.1.0" reconcileID="7f03f0f2-3bad-43c6-a7b8-a86e1edbf271"

Environment

VMware vSphere Kubernetes Service

VKS supervisor service 3.4.1 and higher

Cause

The Client certificate rotation is not handled correctly for the runtime-extension-controller system pod. which will cause a mismatch between the certificate in the runtime-extension-webhook-service-cert secret and the certificate on the runtime-extension-controller pod.
The Cert in the runtime-extension-webhook-service-cert secret will show an newer "notBefore" and "notAfter" dates and different "Serial" number than the Cert assigned to the runtime-extension pod.
- The runtime-extension-webhook-service-cert secret certificate:
  
  # kubectl get secret/runtime-extension-webhook-service-cert -n svc-tkg-domain-## -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -noout -dates -serial
  
  Ex:
  
  # kubectl get secret/runtime-extension-webhook-service-cert -n svc-tkg-domain-c8 -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -noout -dates -serial
  
  notBefore=Jan 31 07:01:27 2026 GMT
  notAfter=May 1 07:01:27 2026 GMT
  serial=14E01E25E2C695056BB1B0D86C271B96
- The runtime-extension Pod certificate:
  
  # kubectl get node $(kubectl get pod <runtime-extension-controller-POD-Name> -n svc-tkg-domain-c8 -o jsonpath='{.spec.nodeName}') -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' | xargs -I {} sh -c "echo | openssl s_client -connect {}:9442 2>/dev/null | openssl x509 -noout -dates -serial"
  
  Ex:
  
  # kubectl get node $(kubectl get pod runtime-extension-controller-manager-6cf4d59849-j2gww -n svc-tkg-domain-c8 -o jsonpath='{.spec.nodeName}') -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' | xargs -I {} sh -c "echo | openssl s_client -connect {}:9442 2>/dev/null | openssl x509 -noout -dates -serial"
  
  notBefore=Jan 2 23:32:33 2026 GMT
  notAfter=Apr 2 23:32:33 2026 GMT
  serial=AB1B66F11FA96D4050F78630996A9E4F

Note:

- You can run the following command to the master node IP address where the runtime-extension-controller pod is located using port 9442

# echo| openssl s_client -connect <mater-node-IP>:9442 2>/dev/null | openssl x509 -noout -dates -serial

Resolution

Resolution

This issue will be resolved in an upcoming release of VKS supervisor service.

Until the release is available, follow the below workaround.

Workaround

The system pod with the CA issue will need to be restarted to correct the certificate issue.

Connect into the Supervisor cluster context.
Restart the runtime-extension-controller pod:

# kubectl get deploy -A | grep runtime
# kubectl rollout restart deploy runtime-extension-controller-manager -n <svc-tkg-domain namespace>
Check that the runtime-extension-controller pod restarted successfully:
# kubectl get pods -n <svc-tkg-domain namespace> | grep runtime
Restart the capi-controller-manager pod:
# kubectl rollout restart deploy -n <svc-tkg-domain namespace> capi-controller-manager
Check that the capi-controller-manager pods restarted successfully:
# kubectl get pods -n <svc-tkg-domain namespace> | grep capi-controller-manager
Confirm that the cluster no longer has the clusterClass error message post-restart of the above system pods:

# kubectl describe cluster -n <cluster namespace> <cluster name>

Notes:

This certificate issue is expected to occur every 60 days, requiring a restart of the above pods.
A fix is currently being worked on and will be available in the an upcoming release of VKS supervisor service.
If the above steps do not correct the issue, reach out to VMware by Broadcom Technical Support and upload a Workload Management Supervisor Cluster Support Bundle. See Gathering Logs for vSphere with Tanzu