Troubleshooting vSphere Kubernetes Cluster VIP Connection Issues
search cancel

Troubleshooting vSphere Kubernetes Cluster VIP Connection Issues

book

Article ID: 388260

calendar_today

Updated On:

Products

VMware vSphere with Tanzu VMware vSphere 7.0 with Tanzu vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

The KB article is designed for troubleshooting vSphere Kubernetes Cluster VIP cannot be reached from within the cluster's control plane nodes.

vSphere Kubernetes Clusters are also known as Guest Clusters.

 

While connected to the Supervisor cluster context, the one or more of the following symptoms may be present:

  • Describing the affected cluster shows a similar error message to the following:
    • failed to create etcd client: could not establish a connection to the etcd leader: [could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded, failed to connect to etcd node]

  • The kubeadm control plane (kcp) object which reconciles control plane nodes shows that all control plane nodes are unavailable.

  • Describing the kubeadm control plane (kcp) object shows similar error messages to the below:
    • failed to create etcd client: could not establish a connection to the etcd leader: [could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded, failed to connect to etcd node]
    •  Reason: RemediationFailed @ /

  • The control-plane-service External IP address for the affected cluster matches the expected VIP:
    • kubectl get service -n <cluster namespace> | grep "control-plane"

  • The endpoints (ep) for the affected cluster match the IP address of each control plane in the cluster:
    • kubectl get ep -n <cluster namespace>

  • The affected cluster control plane node's machines are Running and its VMs are poweredOn with IP addresses assigned:
    • kubectl get machine,vm -n <cluster namespace>

  • The environment's load balancer's pod is Running:
    • kubectl get pods -A | egrep "ncp|ako|lbapi"
    • NSX-T uses the NCP pod. NSX-ALB/AVI uses the AKO pod. HAProxy uses a lbapi pod.


  • Certificates are not expired on the Supervisor cluster.

 

While connected to affected vSphere Kubernetes cluster's context, the following symptoms are present:

  • All kubectl commands are failing, timing out or may return the following error message:
    • The connection to the server localhost:8080 was refused - did you specify the right host or port?

  • Certificates are not expired in the vSphere Kubernetes cluster.

 

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

This issue can occur regardless of whether or not this cluster is managed by Tanzu Mission Control (TMC)

Cause

When the vSphere Kubernetes cluster's VIP is inaccessible, kubectl commands from within the vSphere Kubernetes cluster will fail.

As a result, the Supervisor cluster will be unable to reach the affected cluster's nodes for management and remediation.

This issue can occur even when the Supervisor cluster is able to reach the cluster's VIP.

  • The cluster's VIP is expected to redirect requests sent to its IP address to one of the control plane nodes in the associated vSphere Kubernetes cluster.
  • If the vSphere Kubernetes cluster's kube-apiserver is unreachable due to issues routing from the VIP to one of the control plane nodes, the Supervisor cluster cannot communicate with the vSphere Kubernetes cluster.
  • Kubectl commands within the affected vSphere Kubernetes cluster will fail as these commands reach out to the cluster's VIP first before being routed to a kube-apiserver instance on one of the control plane nodes in the cluster.

Resolution

This KB article will provide steps to troubleshoot VIP connection failures within the affected vSphere Kubernetes cluster.

 

Checks from the Supervisor Cluster as Root

  1. SSH into a Supervisor cluster control plane VM from the VCSA as root:
  2. Check that the control-plane-service for the affected cluster is populated where the External IP address matches the affected cluster's VIP:
    • kubectl get service -n <cluster namespace> | grep "control-plane"
    • If the control plane service is incorrect or empty, this indicates an issue with the load balancer service that provisions and manages this service.


  3. Confirm that the endpoints (ep) are populated with the IP address for each control plane node in the affected cluster:
    • kubectl get ep -n <cluster namespace>
    • If the cluster was inaccessible for a long period of time, the control-plane endpoints may be incorrect or missing.
      • Please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance regarding missing or incorrect endpoints.

  4. Check that the Supervisor control plane VM is able to curl the affected cluster's VIP over port 6443:
    • curl -vk <cluster VIP>:6443
    • If the Supervisor cluster is unable to curl the affected cluster's VIP, this is indicative of a networking issue between the Management Network and the Workload Network or an issue with the control-plane service associated with the VIP managed by the environment's load balancer.
      • The cluster's VIP is expected to redirect requests sent to its IP address to one of the control plane nodes in the associated vSphere Kubernetes cluster.
      • Confirm that there are no issues with the environment's load balancer or control-plane service associated with the VIP.
      • Check if the Supervisor control plane VM's ETH1 and the affected cluster's control plane ETH0 are on different Network CIDRs.
      • The Workload Network needs to be able to communicate between other workload networks and routable to the load balancer network.

  5. Confirm that the Supervisor control plane VM is able to curl the affected cluster's control plane node IP addresses over port 6443:
    • curl -vk <cluster control plane IP>:6443
    • This is similar to the above step's concerns and may also be indicative of a networking issue on the specific control plane node or the ESXI host it is running on.

 

Checks from the affected vSphere Kubernetes Cluster as vmware-system-user

  1. SSH into a control plane node as vmware-system-user:
  2. Confirm on the status of the nodes in the cluster:
    • kubectl get nodes
  3. If kubectl commands are not working at all, check the status and logs of ETCD and kube-apiserver:
    • crictl ps | egrep "etcd|kube-apiserver"
    • crictl logs <container id>
    • If etcd and kube-apiserver are unhealthy, kubectl commands will fail and communication from the Supervisor cluster to the affected cluster will also fail.
    • If kube-apiserver logs report errors connecting to the affected cluster's VIP, this is more indicative of a VIP issue than a kube-apiserver issue. 
    • For either of the above issues, please reach out to VMware by Broadcom Support referencing this KB article for assistance.

  4. Ensure that the certificates have not expired in this cluster:
  5. Confirm that there is not a disk space issue on this node:
    • df -h

  6. Check if it is possible to reach the affected cluster's VIP at port 6443 from this control plane node:
    • curl -vk <affected cluster VIP>:6443
    • If this times-out or fails, this indicates that there is an issue with the control plane node reaching the VIP and could be related to the load balancer used in the environment.

  7. Perform a packet capture to confirm that there is an issue with communicating into the affected cluster from the Supervisor cluster through the affected cluster's VIP:
    • Open separate terminal sessions SSH as vmware-system-user to each control plane node in the affected cluster

    • Start a packet capture from each control plane node in the affected cluster listening from the VIP:
      • tcpdump src <affected cluster VIP> and port 6443

    • Open a separate terminal session to one of the Supervisor control plane nodes as root
      • curl -vk <affected cluster VIP>:6443

    • Confirm if any packets reach any of the control plane nodes
      • The VIP is intended to load balance requests sent from the Supervisor cluster to the control plane nodes of the affected cluster.
      • If 0 packets are received from the tcpdump, there is a networking issue with the affected cluster's VIP. The Supervisor cluster's curl command is expected to reach one of the control plane nodes in the affected cluster through the VIP. 0 packets indicates that there is an issue with the load balancer or packets sent to the VIP are not getting sent to any of the control plane nodes in the affected cluster.