While deploying a cluster with additional trusted CA certificates, the node deployment fails at the first node and no additional nodes are deployed .
message: 'unable to retrieve kube-proxy daemonset from the guest cluster: failed
to get API group resources: unable to retrieve the complete list of server APIs:
apps/v1: Get "https://##.###.###.#:6443/apis/apps/v1?timeout=10s": dial tcp
##.###.###.#:6443: connect: connection refused'
On the capi controller manager logs, you will see messages as below :
YYYY-MM-DDT18:46:58.222879194Z stderr F E0604 18:46:58.220446 1 controller.go:324] "Reconciler error" err="failed to get client: failed to create cluster accessor: error creating http client and mapper for remote cluster \"namespace/guest-cluster-name\": error creating client for remote cluster \"namespace/guest-cluster-name\": cluster is not reachable: Get \"https://##.###.###.#:6443/?timeout=5s\": dial tcp ##.###.###.#:6443: connect: connection refused" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="namespace/guest-cluster-name-659n9-gt28k" namespace="namespace" name="guest-cluster-name-659n9-gt28k" reconcileID="22d122f4-####-####-####-b8205723f296"
Running crictl ps within the node does not reveal any running containers
Within the node var/log/cloud-init-output.log has the below entries :
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] Cloud-init v. 24.4 running 'modules:final' at Mon, 09 Jun 2025 05:40:48 +0000. Up 29.05 seconds.
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] + umount /var/lib/etcd
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] ++ ls -A /var/lib/etcd
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] + '[' '' ']'
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] + mount -t ext4 /dev/sdb1 /var/lib/etcd
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] + rm -rf /var/lib/etcd/lost+found
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] ++ ls -A /var/tmp/_var_lib_etcd
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] ls: cannot access '/var/tmp/_var_lib_etcd': No such file or directory
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] + '[' '' ']'
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] + set -xe
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] + cloud-init single --name write-files --frequency always
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] [YYYY-MM-DD 05:40:48] Cloud-init v. 24.4 running 'single' at Mon, 09 Jun 2025 05:40:48 +0000. Up 29.40 seconds.
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] [YYYY-MM-DD 05:40:48] YYYY-MM-DD 05:40:48,734 - log_util.py[WARNING]: Running module write-files (<module 'cloudinit.config.cc_write_files' from '/usr/lib/python3.11/site-packages/cloudinit/config/cc_write_files.py'>) failed
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] [YYYY-MM-DD 05:40:48] YYYY-MM-DD 05:40:48,736 - main.py[WARNING]: Ran write-files but it failed!
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] YYYY-MM-DD 05:40:48,780 - cc_scripts_user.py[WARNING]: Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] YYYY-MM-DD 05:40:48,780 - log_util.py[WARNING]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.11/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
YYYY-MM-DDT05:40:48.861557+00:00 localhost cloud-init[922]: [YYYY-MM-DD 05:40:48] Cloud-init v. 24.4 finished at Mon, 09 Jun 2025 05:40:48 +0000. Datasource DataSourceVMware [seed=guestinfo]. Up 29.58 seconds
Subsequent cluster deployments in the same namespace with no additional CAs defined may also fail.
When deploying new clusters with single encoded expired CA certificates, the cluster deployment fails without any obvious mentions of certificate errors.
VKS 3.3.x
This happens due to expired single encoded additional CA certificates on the secret used for the cluster.
Create a new vSphere namespace so no secrets with expired / malformed certificates for other clusters are present.
Deploy a cluster by properly double encoding the additional CA certificates as per documentation "v1beta1 Example: Cluster with Additional Trusted CA Certificates for SSL/TLS"