In a vSphere Supervisor cluster environment, new Workload Cluster nodes are failing to create or delete.
While connected to the Supervisor cluster context, one or more of the following symptoms are observed:
kubectl get vm,machine,vspheremachine -n <workload cluster namespace>
kubectl get kcp -n <workload cluster namespace>
kubectl describe kcp -n <workload cluster namespace> <kcp name for the workload cluster>
message: 'failed to create cluster accessor: error creating http client and mapper for remote cluster "<workload-cluster-namespace>/<workload cluster>": error creating client for remote cluster "<workload-cluster-namespace>/<workload cluster>": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://<workload cluster VIP>:6443/api/v1?timeout=10s":net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)'
kubectl get cluster -n <workload cluster namespace> <cluster name> -o yaml | grep -i paused
kubectl describe cluster -n <namespace> <cluster name> | grep -iA2 "endpoint"
Control Plane Endpoint:
Host: <workload cluster VIP>
Port: 6443
kubectl logs deployment/capi-kubeadm-control-plane-controller-manager -n vmware-system-capw -c manager
kubectl logs deployment/capi-controller-manager -n vmware-system-capw -c manager
kubectl logs deployment/capv-controller-manager -n svc-tkg-domain-c## -c manager
err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"<workload-cluster-namespace>/<workload cluster>\": error creating client for remote cluster \"<workload-cluster-namespace>/<workload cluster>\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://<workload cluster VIP>:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
vSphere 7.0 with Tanzu
vSphere 8.0 with Tanzu
This issue can occur regardless of whether or not this cluster is managed by Tanzu Mission Control (TMC)
A known issue with the Cluster API pods where these pods become out of sync with the actual state of the workload cluster(s) in the environment.
In this scenario, the Cluster API system pods which manages and reconciles nodes of workload clusters encountered an issue reaching the workload cluster's VIP previously and has not tried to reach it again.
kubectl get ep -n <workload cluster namespace>
kubectl get vm -o wide -n <workload cluster namespace>
kubectl describe cluster -n <namespace> <cluster name> | grep -iA2 "endpoint"
Control Plane Endpoint:
Host: <workload cluster VIP>
Port: 6443
curl -vk <workload cluster control-plane-service IP address>:6443
kubectl get md,kcp -n <workload cluster namespace>
kubectl get nodes
kubectl get pods -A | egrep -v "Run|Complete"
If the initial checks above have found that the environment is healthy and the workload cluster VIP is reachable from both the Supervisor cluster and workload cluster, please restart the Cluster API system pods as per below.
Restart the Cluster API system pods from the Supervisor cluster context.
kubectl rollout restart deploy -n vmware-system-capw capi-controller-manager
kubectl rollout restart deploy -n vmware-system-capw capi-kubeadm-control-plane-controller-manager
kubectl rollout restart deploy -n svc-tkg-domain-c## capv-controller-manager
kubectl rollout restart deploy -n svc-tkg-domain-c## capi-controller-manager
kubectl rollout restart deploy -n svc-tkg-domain-c## capi-kubeadm-control-plane-controller-manager
Confirm that the system pods were restarted successfully:
kubectl get pods -n vmware-system-capw
kubectl get pods -n svc-tkg-domain-c## | grep cap
Check that the management objects (machinedeployment and kubeadmcontrolplane) now accurately report regarding the workload cluster's nodes status:
kubectl get md,kcp -n <workload cluster namespace>
kubectl get vm,machine -n <workload cluster namespace>