TKGS - vSphere 7 guest cluster is showing as unhealthy and etcd error status is "OwnerRemediated"
search cancel

TKGS - vSphere 7 guest cluster is showing as unhealthy and etcd error status is "OwnerRemediated"

book

Article ID: 334993

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere with Tanzu

Issue/Introduction

Symptoms:

Check the guest cluster status and confirm it is reported as unhealthy

kubectl get tkc -A | grep -i unhealthy 

NAMESPACE   NAME   CONTROL PLANE   WORKER    TKR NAME   AGE    READY  
tanzu-support  tkgs-cluster-v2  3  4  1.20.12+vmware.1-tkg.1.b9a42f3 42d unhealthy 


Next describe the cluster and confirm the following error message

kubectl describe cluster <cluster name>

Message: failed to get etcdStatus for workload cluster tkgs-cluster-v2: failed to create etcd client: could not establish a connection to the etcd leader: unable to create etcd client: context deadline exceeded

    Reason: RemediationFailed @ Machine/tkgs-cluster-v2-control-plane-6shw7

    {

      "lastTransitionTime": "2022-04-22T06:26:45Z",

      "message": "failed to get etcdStatus for workload cluster tkgs-cluster-v2: failed to create etcd client: could not establish a connection to the etcd leader: unable to create etcd client: context deadline exceeded",

      "reason": "RemediationFailed",

      "severity": "Error",

      "status": "False",

      "type": "OwnerRemediated".    <---- confirms the condition

    },


Environment

VMware vSphere 7.0 with Tanzu

Resolution

This is a bug in CAPI/KCP: https://github.com/kubernetes-sigs/cluster-api/pull/5381

May 12th, 2022 release of vCenter (vCenter Server 7.0 update 3e | 12 MAY 2022) is using a newer version of capi that has a fix for this issue.

 


Workaround:

As a workaround, the following steps can be followed to clear the error in the current version( vCenter server 7.0 U2 - 7.0.2.00.500). You will need to delete the associated machine object and let KCP remediate it.


Step 1.  Describe the cluster and identify the unhealthy control plane node as below.

Kubectl describe cluster <cluster-name>

Message: failed to get etcdStatus for workload cluster tkgs-cluster-v2: failed to create etcd client: could not establish a connection
to the etcd leader: unable to create etcd client: context deadline exceeded
Reason: RemediationFailed @ Machine/tkgs-cluster-v2-control-plane-6shw7

Note the controlplane node name from the above output tkgs-cluster-v2-control-plane-6shw7


Step 2.  Get the machine name referencing the control plane node tkgs-cluster-v2-control-plane-6shw7 by running the below command

kubect get machine -n <namespace>

Note the machine name from the above command to delete the machine object in the next step. 

Step 3.  Delete the machine object using the machine name identified in the previous step.

kubectl delete machine <machine-name> -n <namespace>