Troubleshooting ETCD Issues in vSphere Supervisor / vSphere Kubernetes Service (VKS)

search cancel

Troubleshooting ETCD Issues in vSphere Supervisor / vSphere Kubernetes Service (VKS)

book

Article ID: 411644

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

This article provides steps to troubleshoot common ETCD issues in a vSphere Supervisor environment.

Environment

vSphere Supervisor

ETCD issues can occur regardless of whether or not the cluster is managed by Tanzu Mission Control (TMC)

Cause

In the vSphere Supervisor product, a cluster's database is maintained and managed by the ETCD process.

This ETCD process relies on a healthy quorum matching the expected number of control plane nodes in the cluster.

If ETCD quorum is unhealthy or broken, ETCD will experience issues and may fail.

As many system services rely on ETCD and its database, the system will not function properly or fail as well if ETCD is unhealthy.

This includes kubectl commands which rely on the kube-apiserver. Kube-apiserver is dependent on the health of ETCD and its database.

Resolution

In the vSphere Supervisor product, it is important to understand which cluster is affected by the ETCD issue. There are individual instances of ETCD on each control plane node in a cluster, and an individual ETCD database runs on the Supervisor Cluster and within each workload cluster in the environment separately.

Troubleshooting the Supervisor Cluster's ETCD

SSH into one of the Supervisor Control Plane VMs:
- See "How to SSH into Supervisor Control Plane VMs" from KB article Troubleshooting vSphere Supervisor Control Plane VMs
- Note: If ETCD is unhealthy with a broken quorum, the FIP as per the above KB article will be inaccessible.
  - Directly use the ETH0 of a Supervisor Control Plane VM to connect instead.
Check the expected member count and IPs for Supervisor Control Plane VMs in the ETCD quorum:
```
etcdctl member list -w table
```

Compare the above member list output with the current health of all members in the ETCD quorum:
This output represents a health check on each member of the ETCD quorum. See the below for an example of a healthy ETCD quorum:

etcdctl --cluster=true endpoint health -w table

An unhealthy ETCD quorum will return that one or more members are in false state and output a related error message:

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.ssssssZ","logger":"client","caller":"v#@v<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001aa000/<Supervisor VM ETH0>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp <Supervisor VM ETH0>:2379: connect: connection refused\""}

The status and leader of the ETCD quorum can be found with the following command:
See the below for an example of a healthy ETCD quorum:

etcdctl --cluster=true endpoint status -w table

Any unhealthy ETCD members will not appear in the above table and an error message will be output:

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","logger":"etcd-client","caller":"v#@v<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001aa000/<localhost>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp <Supervisor VM ETH0>:2379: connect: connection refused\""}
Failed to get the status of endpoint https://<Supervisor VM ETH0>:2379 (context deadline exceeded)

Any Supervisor Control Plane VMs which outputs an unhealthy ETCD should be connected into through SSH for further troubleshooting.

Check on the status of the ETCD container in the affected, unhealthy Supervisor Control Plane VM:
```
crictl ps --name etcd

CONTAINER               IMAGE               CREATED        STATE         NAME      ATTEMPT     POD ID              POD
<etcd container id>     <etcd image id>     # days ago    Running        etcd      #           <etcd pod id>       <etcd pod name>
```
If the above command does not return an ETCD container in Running state, then ETCD is down or crashing on this particular Supervisor Control Plane VM.

If ETCD is not Running in the affected, unhealthy Supervisor Control Plane VM, check for the latest etcd container in Exited state:

crictl ps -a --name etcd

CONTAINER               IMAGE               CREATED        STATE         NAME      ATTEMPT     POD ID              POD
<etcd container id>     <etcd image id>     # days ago     Exited        etcd      #           <etcd pod id>       <etcd pod name>

An ETCD container's logs can be viewed through crictl:
```
crictl logs <etcd container id>
```
You can also view the etcd container logs under the following directory:
```
ls /var/log/pods/kube-system_etcd-*/etcd/
```

The kubelet service is expected to start up an Exited ETCD container every 5 minutes.

If there is not a ETCD container in Running state and the only ETCD container remains in Exited state for longer than 5 minutes, check the status of kubelet service:

systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; # days ago

If kubelet is not in active (running) state, its logs should be checked and your priority should be to restore kubelet to a healthy, functional state:
```
journalctl -xeu kubelet
```
See the table under Additional Information for a list of ETCD KB articles.

Troubleshooting a Workload Cluster's ETCD

Connect into the Supervisor Cluster context

Confirm on the expected number of control plane nodes for the affected workload cluster, the below is an example of a 3/3 control plane node cluster with all healthy nodes:

kubectl get kcp -n <workload cluster namespace>

NAME               CLUSTER           INITIALIZED   API SERVER    AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE    VERSION
<kcp name>   <workload cluster name>    true       true           3          3           3         0           0        #d   <VKR version>

It is expected for there to be one instance of ETCD per control plane node.
- For a workload cluster with 3/3 control plane nodes, there should be an equivalent 3/3 ETCD quorum.
  - ETCD will be in a degraded state if there are 2/3 control plane nodes in the workload cluster.
  - If there is only 1/3 control plane nodes in the workload cluster, ETCD will fail.
- For a workload cluster built with only 1 control plane node, ETCD will function properly in a 1/1 quorum.
  - However, the loss of the only control plane node in the cluster will destroy the ETCD database and render the cluster irrecoverable.

The KCP object from Step 2 can be described for more information on how the Supervisor cluster views the status of all control plane nodes in the affected workload cluster:
See the below for an example of a healthy KCP object with 3 control plane nodes:

kubectl describe kcp <kcp name> -n <workload cluster namespace>

  Ready:                   true
  Ready Replicas:          3
  Replicas:                3
  Selector:                cluster.x-k8s.io/cluster-name=<workload cluster name>,cluster.x-k8s.io/control-plane
  Unavailable Replicas:    0
  Updated Replicas:        3
  Version:                 <VKR version>
Events:                    <none>

SSH into one of the control plane nodes of the affected workload cluster:
- See SSH to VKS Cluster Nodes as System User Using a Password

Check on the status of the ETCD container in the current workload cluster control plane node:

crictl ps --name etcd

CONTAINER               IMAGE               CREATED        STATE         NAME      ATTEMPT     POD ID              POD
<etcd container id>     <etcd image id>     # days ago    Running        etcd      #           <etcd pod id>       <etcd pod name>

If ETCD is not in Running state on the current workload cluster control plane node, check for any Exited ETCD containers:

crictl ps -a --name etcd

CONTAINER               IMAGE               CREATED        STATE         NAME      ATTEMPT     POD ID              POD
<etcd container id>     <etcd image id>     # days ago     Exited        etcd      #           <etcd pod id>       <etcd pod name>

An ETCD container's logs can be viewed through crictl:
```
crictl logs <etcd container id>
```
You can also view the etcd container logs under the following directory:
```
ls /var/log/pods/kube-system_etcd-*/etcd/
```

The overall status of the ETCD quorum on a workload cluster can be viewed through ETCDCTL.

Retrieve the ETCD container ID of a Running ETCD:

crictl ps --name etcd
CONTAINER               IMAGE               CREATED        STATE         NAME      ATTEMPT     POD ID              POD
<etcd container id>     <etcd image id>     # days ago    Running        etcd      #           <etcd pod id>       <etcd pod name>

Establish the alias for the CLI used to interface with the ETCD database by using the above Running ETCD container ID:

alias etcdctl='crictl exec <etcd container id>  etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'

Once the ETCDCTL alias is established, check on the expected member list with IPs in the ETCD quorum:
```
etcdctl member list -w table
```
See the below for an example of a workload cluster with 3/3 control plane nodes:

View the health of each ETCD member and its quorum:

etcdctl --cluster=true endpoint health -w table

The below is an example of a workload cluster with 3/3 healthy ETCD quorum:

If an ETCD member is unhealthy, its health would show as false state and output an error message similar to the following:

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.ssssssZ","logger":"client","caller":"v#@<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002b8fc0/<Control Plane VM IP>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp <Control Plane VM IP>:2379: connect: connection refused\""}

Confirm on the status of each ETCD member, its quorum and which ETCD member is the leader:

etcdctl --cluster=true endpoint status -w table

See the below for an example of a workload cluster with 3/3 control plane nodes with healthy ETCD:

Any unhealthy ETCD members will not show in the above table and the a similar error message will be output accordingly:

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.ssssssZ","logger":"etcd-client","caller":"v#@<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002e7180/<localhost>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp <Control Plane VM IP>:2379: connect: connection refused\""}
Failed to get the status of endpoint https://<Control Plane VM IP>:2379 (context deadline exceeded)

The kubelet service is expected to start up an Exited ETCD container every 5 minutes.

If there is not an ETCD container in Running state and the only ETCD container remains in Exited state for longer than 5 minutes, check the status of kubelet service:

systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; # days ago

If kubelet is not in active (running) state, its logs should be checked and your priority should be to restore kubelet to a healthy, functional state:
```
journalctl -xeu kubelet
```
See the table under Additional Information for a list of ETCD KB articles.

Additional Information

ETCD Knowledge Base Articles

Issue	Knowledge Base Article (KB)
ETCD is failing because the control plane node is out of disk space.	vSphere Supervisor Root Disk Space Full at 100%
ETCD is failing because of an expired certificate	Replace vSphere with Tanzu Guest Cluster/vSphere Kubernetes Cluster Certificates Replace vSphere with Tanzu Supervisor Certificates
ETCD is failing on trying to use /bin/sh `"OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown"`	Liveness probe failed on etcd pods after Supervisor upgrade to v1.30.10 impacting kube-api
ETCD Database is full `"etcdserver: mvcc: database space exceeded"`	vSphere Kubernetes Cluster Unhealthy, Kubectl Commands Failing due to ETCD Database Full or Exceeded
ETCD logs show panic errors `"panic: assertion failed: Page expected to be:"`	etcd and kube-apiserver pods are in CrashLoopBackOff on Guest Cluster after a Power Outage Event
ETCD logs repeat etcd leader changed: etcdserver: leader changed	Kubectl Commands Failing with "etcdserver: leader changed"
ETCD quorum shows 3/3 control plane nodes, but one of the ETCD members does not match the existing control plane nodes of the cluster	Stale ETCD Member Prevents Workload Cluster Upgrade
ETCD is running on each control plane node, but its logs report that it cannot connect to the other control plane nodes.	ETCD Unhealthy in Control Plane Nodes due to VMs Unable to Communicate
ETCD is unhealthy because one control plane node was manually deleted	Recover Guest Cluster after a Control Plane Node was Deleted Manually

Feedback

thumb_up Yes

thumb_down No