Unable to connect to VKS cluster due to 'bad gateway' error
search cancel

Unable to connect to VKS cluster due to 'bad gateway' error

book

Article ID: 434115

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • When attempting to connect to a VKS cluster, a 'bad gateway' error occurs.
  • Additionally, one of the TKG control plane nodes is repeatedly recreated every 120 minutes.
  • Reviewing the /var/log/cloud-init-output.log on the newly created control plane node shows a connection timeout when attempting to reach the supervisor:

+ curl --connect-timeout 20 --retry 6 --retry-delay 10 --resolve supervisor.default.svc:6443:<Supervisor_control_plane_ip> https://supervisor.default.svc:6443/api/v1/namespaces/<tkg-domain-namespace>/services/machine-agent-service/proxy/x86/linux/v1alpha1/ --header 'Authorization: Bearer <###############>' --cacert /run/machine-agent/ca-cert -Lo /usr/local/bin/machineadm
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
...
curl: (##) Connection timed out after ##### milliseconds
Warning: Problem : timeout. Will retry in 10 seconds. 6 retries left.

  • Manual curl tests over ports 443 and 6443 to the supervisor IP address fail from both the failing and functioning guest control planes.

    curl -kv <Supervisor_control_plane_ip>:443
    curl -kv <Supervisor_control_plane_ip>:6443

  • Examining the Load Balancer displays an error on the associated virtual service.

Environment

  • vCenter Server 8.x
  • vSphere Kubernetes Service

Cause

An underlying fault or misconfiguration with the Load Balancer's virtual service is preventing the newly created control plane from successfully communicating with the supervisor.

Resolution

  1. Engage the internal network or Load Balancer administration team to review the Load Balancer virtual service health, logs, and active alarms.
  2. If Avi Load Balancer is used and assistance is required to resolve the virtual service fault, create an SR with the Broadcom NSX Advanced Load Balancer (Avi) support team.
  3. If any other Load Balancer is used, engage the Load Balancer's support.
  4. Once the Load Balancer issue is mitigated, monitor the cluster to verify the failed control plane node is successfully recreated and enters a running state.
  5. Confirm all control planes, worker nodes, and pods establish connectivity and report as healthy.

Additional Information

A potential cause for this behavior is reaching the virtual service limit on the Avi Load Balancer Service Engine.
Review Broadcom KB 413263 for details on modifying the Maximum Number of Virtual Services per Service Engine if this specific symptom is identified by the Load Balancer team.