Example: Horizontally scale a VKS cluster worker nodes count by changing the number of nodes will not start.
# kubectl get md <clusterName> -n <namespace>NAME CLUSTER REPLICAS READY UPDATED UNAVAILABLE PHASE AGE VERSIONmachinedeployment.cluster.x-k8s.io/clusterName-worker-l9crz clusterName 5 5 5 0 Running 148d v1.##.1+vmware.1-fips
# kubectl describe cluster <clusterName> -n <namespace> Message: Observed Generation: 12 Reason: Available Status: True Type: WorkersAvailable Last Transition Time: [timestamp] Message: ClusterClass is not successfully reconciled: status of VariablesReconciled condition on ClusterClass must be "True" Observed Generation: 12 Reason: ReconcileFailed Status: False Type: TopologyReconciled Last Transition Time: [timestamp] Message: Observed Generation: 12 Reason: NotRollingOut Status: False Type: RollingOut Last Transition Time: [timestamp]
# kubectl get cc -n svc-tkg-domain-c### builtin-generic-v3.3.0 -o jsonpath='{.status.conditions}' | jq[ { "lastTransitionTime": "[timestamp]", "status": "True", "type": "RefVersionsUpToDate" }, { "lastTransitionTime": "[timestamp]", "message": "VariableDiscovery failed: failed to call DiscoverVariables for patch default: failed to call extension handler \"discover-variables.runtime-extension\": http call failed: Post \"https://runtime-extension-webhook-service.svc-tkg-domain-c##.svc:443/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovervariables/discover-variables?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"x509: invalid signature: parent certificate cannot sign this kind of certificate\" while trying to verify candidate authority certificate \"serial:340174157205######8522218632\")", "reason": "VariableDiscoveryFailed", "severity": "Error", "status": "False", "type": "VariablesReconciled" }]
The runtime-extension-controller-manager-########## pod logs was showing TLS error "failed to verify certificate: x509: certificate signed by unknown authority"# kubectl logs -n svc-tkg-domain-## runtime-extension-controller-manager-###########
[timestamp] 1 ???:1] "http: TLS handshake error from 10.#.#.12:58377: tls: failed to verify certificate: x509: certificate signed by unknown authority"[timestamp] 1 ???:1] "http: TLS handshake error from 10.#.#.12:2134: tls: failed to verify certificate: x509: certificate signed by unknown authority"[timestamp] 1 ???:1] "http: TLS handshake error from 10.#.#.12:25239: tls: failed to verify certificate: x509: certificate signed by unknown authority"
nHandler="discover-variables.runtime-extension" hook="DiscoverVariables"[timestamp] 1 controller.go:347] "Reconciler error" err="failed to discover variables for ClusterClass builtin-generic-v3.1.0: failed to call DiscoverVariables for patch default: failed to call extension handler \"discover-variables.runtime-extension\": http call failed: Post \"https://runtime-extension-webhook-service.svc-tkg-domain-c##.svc:443/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovervariables/discover-variables?timeout=10s\": remote error: tls: unknown certificate authority" controller="clusterclass" controllerGroup="cluster.x-k8s.io" controllerKind="ClusterClass" ClusterClass="vmware-system-monitoring/builtin-generic-v3.1.0" namespace="vmware-system-monitoring" name="builtin-generic-v3.1.0" reconcileID="7f03f0f2-3bad-####-a7b8-a86e1edbf271"
VMware vSphere Kubernetes Service
VKS supervisor service 3.4.1 and higher
notAfter' dates and a different serial number than the one assigned to the runtime-extension pod.# kubectl get secret/runtime-extension-webhook-service-cert -n svc-tkg-domain-## -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -noout -dates -serialEx:# kubectl get secret/runtime-extension-webhook-service-cert -n svc-tkg-domain-c## -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -noout -dates -serialnotBefore=Jan 31 07:01:27 #### GMTnotAfter=May 1 07:01:27 #### GMTserial=14E01E25####5056BB1B0D86C271B96# kubectl get node $(kubectl get pod <runtime-extension-controller-POD-Name> -n svc-tkg-domain-c8 -o jsonpath='{.spec.nodeName}') -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' | xargs -I {} sh -c "echo | openssl s_client -connect {}:9442 2>/dev/null | openssl x509 -noout -dates -serial"# kubectl get node $(kubectl get pod runtime-extension-controller-manager-6c###849-j2gww -n svc-tkg-domain-c8 -o jsonpath='{.spec.nodeName}') -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' | xargs -I {} sh -c "echo | openssl s_client -connect {}:9442 2>/dev/null | openssl x509 -noout -dates -serial"
notBefore=Jan 2 23:32:33 #### GMTnotAfter=Apr 2 23:32:33 #### GMTserial=AB1B66F11F####78630996A9E4F
echo | openssl s_client -connect <master-node-IP>:9442 2>/dev/null | openssl x509 -noout -dates -serialResolution
This issue is resolved in vSphere Kubernetes Service 3.6.0+v1.35 . Refer VMware vSphere Kubernetes Service Release Notes
Workaround
The system pod with the CA issue will need to be restarted to correct the certificate issue.
# kubectl get deploy -A | grep runtime# kubectl rollout restart deploy runtime-extension-controller-manager -n <svc-tkg-domain namespace>
# kubectl get pods -n <svc-tkg-domain namespace> | grep runtime# kubectl rollout restart deploy -n <svc-tkg-domain namespace> capi-controller-manager# kubectl get pods -n <svc-tkg-domain namespace> | grep capi-controller-manager# kubectl describe cluster -n <cluster namespace> <cluster name>
Notes: