A previously healthy Tanzu Kubernetes Grid (TKG) workload cluster is inaccessible and the status shows "creating"

search cancel

A previously healthy Tanzu Kubernetes Grid (TKG) workload cluster is inaccessible and the status shows "creating"

book

Article ID: 319323

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

A TKG workload cluster that was previously healthy, now is in a "creating" state. You see output similar to the following when you run the tkg get cluster command:

NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
clusterx default creating 0/3 0/4 v1.18.6+vmware.1

You cannot run any commands against the cluster as the control plane is not available. You see output similar to the following when you run the kubectl get nodes command against the affected workload cluster:

The connection to the server 192.168.10.50:6443 was refused - did you specify the right host or port?

You see that the /etc/haproxy/haproxy.cfg file is empty on the HAProxy load balancer VM for the affected workload cluster.

Environment

VMware Tanzu Kubernetes Grid Plus 1.x
VMware Tanzu Kubernetes Grid 1.x

Resolution

This issue can occur when the HAProxy load balancer for the affected cluster has lost its configuration. This is a known issue with HAProxy (https://github.com/haproxytech/dataplaneapi/issues/114 ), which will be addressed in the next upstream release. There is currently no resolution for this issue in TKG.

Workaround:
To workaround this issue, use the following steps to restore the HAProxy configuration.

SSH to the HAProxy load balancer VM as the capv user and SSH key provided during TKG management cluster deployment, for example:

ssh -i .ssh/id_rsa [email protected]

Verify that file /etc/haproxy/haproxy.cfg is empty

sudo cat /etc/haproxy/haproxy.cfg

If the file is empty, use the following two commands to restore it from the backup file and restart the HAProxy VM

sudo cp /etc/haproxy/haproxy.cfg.lkg /etc/haproxy/haproxy.cfg
sudo reboot

Additional Information

Impact/Risks:
Unable to manage affected workload cluster.

Feedback

thumb_up Yes

thumb_down No