Check the guest cluster status and confirm it is reported as unhealthy
kubectl get tkc -A | grep -i unhealthy
NAMESPACE NAME CONTROL PLANE WORKER TKR NAME AGE READY
tanzu-support tkgs-cluster-v2 3 4 1.20.12+vmware.1-tkg.1.b9a42f3 42d unhealthy
Next describe the cluster and confirm the following error message
kubectl describe cluster <cluster name>
Message: failed to get etcdStatus for workload cluster tkgs-cluster-v2: failed to create etcd client: could not establish a connection to the etcd leader: unable to create etcd client: context deadline exceeded
Reason: RemediationFailed @ Machine/tkgs-cluster-v2-control-plane-6shw7
{
"lastTransitionTime": "2022-04-22T06:26:45Z",
"message": "failed to get etcdStatus for workload cluster tkgs-cluster-v2: failed to create etcd client: could not establish a connection to the etcd leader: unable to create etcd client: context deadline exceeded",
"reason": "RemediationFailed",
"severity": "Error",
"status": "False",
"type": "OwnerRemediated". <---- confirms the condition
},
This is a bug in CAPI/KCP: https://github.com/kubernetes-sigs/cluster-api/pull/5381
May 12th, 2022 release of vCenter (vCenter Server 7.0 update 3e | 12 MAY 2022) is using a newer version of capi that has a fix for this issue.
As a workaround, the following steps can be followed to clear the error in the current version( vCenter server 7.0 U2 - 7.0.2.00.500). You will need to delete the associated machine object and let KCP remediate it.
Step 1. Describe the cluster and identify the unhealthy control plane node as below.
Kubectl describe cluster <cluster-name>
Message: failed to get etcdStatus for workload cluster tkgs-cluster-v2: failed to create etcd client: could not establish a connection
to the etcd leader: unable to create etcd client: context deadline exceeded
Reason: RemediationFailed @ Machine/tkgs-cluster-v2-control-plane-6shw7
Note the controlplane node name from the above output tkgs-cluster-v2-control-plane-6shw7
Step 2. Get the machine name referencing the control plane node tkgs-cluster-v2-control-plane-6shw7 by running the below command
kubect get machine -n <namespace>
Note the machine name from the above command to delete the machine object in the next step.
Step 3. Delete the machine object using the machine name identified in the previous step.
kubectl delete machine <machine-name> -n <namespace>