vSphere Supervisor Workload Cluster New Nodes Not Creating or Deleting due to CAPI pods failing to reach the Workload Cluster's VIP

Products

Tanzu Kubernetes Runtime VMware vSphere 7.0 with Tanzu vSphere with Tanzu VMware vSphere Kubernetes Service

Issue/Introduction

In a vSphere Supervisor cluster environment, new Workload Cluster nodes are failing to create or delete.

While connected to the Supervisor cluster context, one or more of the following symptoms are observed:

There is no virtual machine (vm), machine or vspheremachine object created for a new node or a machine is stuck in Deleting state:
- Note: vSphere 7 environments use wcpmachine instead of vspheremachine
```
kubectl get vm,machine,vspheremachine -n <workload cluster namespace>
```

Describing the kubeadmcontrolplane (kcp) object shows two or more control plane nodes are Unavailable, with errors similar to the following:

kubectl get kcp -n <workload cluster namespace>

kubectl describe kcp -n <workload cluster namespace> <kcp name for the workload cluster>

message: 'failed to create cluster accessor: error creating http client and mapper for remote cluster "<workload-cluster-namespace>/<workload cluster>": error creating client for remote cluster "<workload-cluster-namespace>/<workload cluster>": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://<workload cluster VIP>:6443/api/v1?timeout=10s":net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)'

The affected workload cluster is not paused and the following command does not return "paused: true":
```
kubectl get cluster -n <workload cluster namespace> <cluster name> -o yaml | grep -i paused
```

The latest system pod logs show an error message similar to the below, indicating that it could not reach the workload cluster's VIP at port 6443:

The workload cluster VIP can be found using the following command:

kubectl describe cluster -n <namespace> <cluster name> | grep -iA2 "endpoint"

Control Plane Endpoint:
Host:  <workload cluster VIP>
Port:  6443

To find the associated Cluster API pods prior to VKS Supervisor Service:

kubectl logs deployment/capi-kubeadm-control-plane-controller-manager -n vmware-system-capw  -c manager

kubectl logs deployment/capi-controller-manager -n vmware-system-capw  -c manager

Once VKS Supervisor Service is installed, the Cluster API pods are found with a different namespace, where svc-tkg-domain-c## will vary between environments

kubectl logs deployment/capv-controller-manager -n svc-tkg-domain-c## -c manager

err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"<workload-cluster-namespace>/<workload cluster>\": error creating client for remote cluster \"<workload-cluster-namespace>/<workload cluster>\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://<workload cluster VIP>:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

This issue can occur regardless of whether or not this cluster is managed by Tanzu Mission Control (TMC)

Cause

A known issue with the Cluster API pods where these pods become out of sync with the actual state of the workload cluster(s) in the environment.

In this scenario, the Cluster API system pods which manages and reconciles nodes of workload clusters encountered an issue reaching the workload cluster's VIP previously and has not tried to reach it again.

Resolution

Initial Checks

While connected to the Supervisor cluster context, Confirm that the endpoints for the affected workload cluster match the expected control plane VM IP addresses:
```
kubectl get ep -n <workload cluster namespace>
```
```
kubectl get vm -o wide -n <workload cluster namespace>
```
- If the above IP addresses do not match, reach out to VMware by Broadcom Technical Support referencing this KB article.

While SSH into a Supervisor control plane VM, check that the affected workload cluster's VIP is reachable:

Note down the workload cluster's VIP:

kubectl describe cluster -n <namespace> <cluster name> | grep -iA2 "endpoint"

Control Plane Endpoint:

Host:  <workload cluster VIP>
Port:  6443

While SSH into one of the workload cluster control plane VMs, check if this control plane VM can reach the workload cluster's VIP:
```
curl -vk <workload cluster control-plane-service IP address>:6443
```
- If this fails, see the following KB article: Troubleshooting vSphere Kubernetes Cluster VIP Connection Issues
From the Supervisor cluster context, the machinedeployment (md) for workers, and/or the kubeadmcontrolplane (kcp) for control planes inaccurately report that one or more associated nodes are Unavailable:
```
kubectl get md,kcp -n <workload cluster namespace>
```
While connected to the workload cluster's context, there are no issues running kubectl commands and all system pods and nodes are healthy:
```
kubectl get nodes

kubectl get pods -A | egrep -v "Run|Complete"
```

If the initial checks above have found that the environment is healthy and the workload cluster VIP is reachable from both the Supervisor cluster and workload cluster, please restart the Cluster API system pods as per below.

Restart the Cluster API system pods from the Supervisor cluster context.

Prior to VKS Supervisor Service:

kubectl rollout restart deploy -n vmware-system-capw capi-controller-manager

kubectl rollout restart deploy -n vmware-system-capw capi-kubeadm-control-plane-controller-manager

After VKS Supervisor Service is installed, where namespace svc-tkg-domain-c## will vary by environment:

kubectl rollout restart deploy -n svc-tkg-domain-c## capv-controller-manager

kubectl rollout restart deploy -n svc-tkg-domain-c## capi-controller-manager

kubectl rollout restart deploy -n svc-tkg-domain-c## capi-kubeadm-control-plane-controller-manager

Confirm that the system pods were restarted successfully:

Prior to VKS Supervisor Service:
```
kubectl get pods -n vmware-system-capw
```
After VKS Supervisor Service is installed, where namespace svc-tkg-domain-c## will vary by environment:
```
kubectl get pods -n svc-tkg-domain-c## | grep cap
```

Check that the management objects (machinedeployment and kubeadmcontrolplane) now accurately report regarding the workload cluster's nodes status:

kubectl get md,kcp -n <workload cluster namespace>

kubectl get vm,machine -n <workload cluster namespace>