Troubleshooting vSphere Supervisor Workload Cluster VIP Connection Issues

search cancel

Troubleshooting vSphere Supervisor Workload Cluster VIP Connection Issues

book

Article ID: 388260

calendar_today

Updated On: 05-13-2025

Products

VMware vSphere Kubernetes Service VMware vSphere 7.0 with Tanzu vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

The KB article is designed for troubleshooting vSphere Workload Cluster VIP cannot be reached from within the cluster's control plane nodes.

vSphere Workload Clusters are also known as Guest Clusters.

While connected to the Supervisor cluster context, the one or more of the following symptoms may be present:

Describing the affected workload cluster shows a similar error message to the following:

failed to create etcd client: could not establish a connection to the etcd leader: [could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded, failed to connect to etcd node]

The kubeadm control plane (kcp) object which reconciles workload cluster control plane nodes shows that all workload cluster control plane nodes are unavailable.

Describing the kubeadm control plane (kcp) object shows similar error messages to the below:

failed to create etcd client: could not establish a connection to the etcd leader: [could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded, failed to connect to etcd node]

```
Reason: RemediationFailed @ /
```

The control-plane-service External IP address for the affected workload cluster matches the expected VIP:
- ```
kubectl get service -n <cluster namespace> | grep "control-plane"
```
The endpoints (ep) for the affected workload cluster match the IP address of each control plane in the cluster:
- ```
kubectl get ep -n <cluster namespace>
```
The affected workload cluster control plane node's machines are Running and the VMs are poweredOn with IP addresses assigned:
- ```
kubectl get machine,vm -o wide -n <cluster namespace>
```
Any newly created nodes in the affected workload cluster are stuck in Provisioned state:
- ```
kubectl get machine -n <cluster namespace>
```
Any deleted nodes in the affected workload cluster are stuck in Deleting state:
- ```
kubectl get machine -n <cluster namespace>
```
The environment's load balancer's pod is Running:
- ```
kubectl get pods -A | egrep "ncp|ako|lbapi"
```
- NSX-T uses the NCP pod. NSX-ALB/AVI uses the AKO pod. HAProxy uses a lbapi pod.
Certificates are not expired on the Supervisor cluster.
- See Replace vSphere with Tanzu Supervisor Certificates

While connected to affected vSphere Workload Cluster's context, the following symptoms are present:

All kubectl commands are failing, timing out or may return the following error message:

The connection to the server localhost:8080 was refused - did you specify the right host or port?

Certificates are not expired in the vSphere Workload Cluster.
- See Replace vSphere Kubernetes Guest Cluster Certificates

While SSH directly to a new node in Provisioned state, the following symptoms are present:

There are no containers running in the new node, the following command returns an empty output:
- ```
crictl ps -a
```
System service containerd is Running:
- ```
systemctl status containerd
```

System service kubelet is not running and shows the following error message:

```
systemctl status kubelet
```

journalctl -xeu kubelet

"command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"

The above config.yaml is not present on the node.

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

This issue can occur regardless of whether or not this cluster is managed by Tanzu Mission Control (TMC)

Cause

When the vSphere Workload cluster's VIP is inaccessible, kubectl commands from within the vSphere Workload cluster will fail.

As a result, the Supervisor cluster will be unable to reach the affected workload cluster's nodes for management and remediation.

This includes the creation and deletion of the affected workload cluster's nodes.

This issue can occur even when the Supervisor cluster is able to reach the workload cluster's VIP.

The workload cluster's VIP is expected to redirect requests sent to its IP address to one of the control plane nodes' kube-apiserver in the associated vSphere Workload cluster.
If the vSphere Workload cluster's kube-apiserver is unreachable due to issues routing from the VIP to one of the control plane nodes, the Supervisor cluster cannot communicate with the vSphere Workload cluster.
Kubectl commands within the affected vSphere Workload cluster will fail as these commands reach out to the workload cluster's VIP first before being routed to a kube-apiserver instance on one of the control plane nodes in the cluster.

Resolution

This KB article will provide steps to troubleshoot VIP connection failures within the affected vSphere Workload Cluster.

Checks from the Supervisor Cluster as Root

SSH into a Supervisor cluster control plane VM from the VCSA as root:
- See "How to SSH into Supervisor Control Plane VMs" from Troubleshooting vSphere with Tanzu Supervisor Control Plane VMs
Check that the control-plane-service for the affected workload cluster is populated where the External IP address matches the affected workload cluster's VIP:
- ```
kubectl get service -n <cluster namespace> | grep "control-plane"
```
- If the control-plane service is incorrect or empty, this indicates an issue with the load balancer service that provisions and manages this service.
Confirm that the endpoints (ep) are populated with the IP address for each control plane node in the affected workload cluster:
- ```
kubectl get ep -n <cluster namespace>
```
- If the cluster was inaccessible for a long period of time, the control-plane endpoints may be incorrect or missing.
  - Please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance regarding missing or incorrect endpoints.
Check that the Supervisor control plane VM is able to curl the affected workload cluster's VIP over port 6443:
- ```
curl -vk <cluster VIP>:6443
```
- If the Supervisor cluster is unable to curl the affected workload cluster's VIP, this is indicative of a networking issue between the Management Network and the Workload Network or an issue with the control-plane service associated with the VIP managed by the environment's load balancer.
  - The cluster's VIP is expected to redirect requests sent to its IP address to one of the control plane nodes in the associated vSphere Workload Cluster.
    - Confirm that there are no issues with the environment's load balancer or control-plane service associated with the VIP.
  - Check if the Supervisor control plane VM's ETH1 and the affected workload cluster's control plane ETH0 are on different Network CIDRs.
    - The Workload Network needs to be able to communicate between other workload networks and routable to the load balancer network.
Confirm that the Supervisor control plane VM is able to curl the affected workload cluster's control plane node IP addresses over port 6443:
- ```
curl -vk <cluster control plane IP>:6443
```
- This is similar to the above step's concerns and may also be indicative of a networking issue on the specific control plane node or the ESXI host it is running on.

Checks from the affected vSphere Workload Cluster as vmware-system-user

SSH into a control plane node as vmware-system-user:
- SSH to TKG Service Cluster as the System User Using a Password
Confirm on the status of the nodes in the cluster:
- ```
kubectl get nodes
```
If kubectl commands are not working at all, check the status and logs of ETCD and kube-apiserver:
- ```
crictl ps | egrep "etcd|kube-apiserver"
```
- ```
crictl logs <container id>
```
- If etcd and kube-apiserver are unhealthy, kubectl commands will fail and communication from the Supervisor cluster to the affected workload cluster will also fail.
- If kube-apiserver logs report errors connecting to the affected workload cluster's VIP, this is more indicative of a workload cluster VIP routing issue than a kube-apiserver issue.
- For either of the above issues, please reach out to VMware by Broadcom Support referencing this KB article for assistance.
Ensure that the certificates have not expired in this cluster:
- ```
kubeadm certs check-expiration
```
- If certificates are expired in the cluster, see Replace vSphere Kubernetes Guest Cluster Certificates
Confirm that there is not a disk space issue on this node:
- ```
df -h
```
Check if it is possible to reach the affected workload cluster's VIP at port 6443 while SSH into this workload cluster control plane node:
- ```
curl -vk <affected workload cluster VIP>:6443
```
- If this times-out or fails, this indicates that there is an issue with the control plane node reaching the workload cluster's VIP and could be related to the load balancer used in the environment.
- Requests sent to the affected workload cluster's VIP are expected to be routed to the Workload Gateway before being sent to the workload cluster's VIP.
  - It is advised to perform a traceroute from the workload cluster's control plane node to the affected workload cluster's VIP.
Perform a packet capture to confirm that there is an issue with communicating into the affected workload cluster from the Supervisor cluster through the affected workload cluster's VIP:
- Open separate terminal SSH sessions as vmware-system-user to each control plane node in the affected workload cluster
- Start a packet capture from each control plane node in the affected workload cluster listening from the workload cluster's VIP:
  - ```
  tcpdump src <affected workload cluster VIP> and port 6443
```
- Open a separate terminal SSH session into one of the Supervisor control plane nodes to send a request to the workload cluster's VIP:
  - ```
  curl -vk <affected workload cluster VIP>:6443
```
- Confirm if any packets reach any of the affected workload cluster's control plane nodes
  - The VIP is intended to load balance requests sent from the Supervisor cluster to the control plane nodes of the affected workload cluster.
- If 0 packets are received from the tcpdump, there is a networking issue with the affected workload cluster's VIP.
  - The Supervisor cluster's curl command is expected to reach one of the control plane nodes in the affected workload cluster through the VIP.
  - 0 packets indicates that there is an issue with the load balancer or packets sent to the VIP are not getting sent to any of the control plane nodes in the affected workload cluster.

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No