This article provides steps to troubleshoot common ETCD issues in a vSphere Supervisor environment.
vSphere Supervisor
ETCD issues can occur regardless of whether or not the cluster is managed by Tanzu Mission Control (TMC)
In the vSphere Supervisor product, a cluster's database is maintained and managed by the ETCD process.
This ETCD process relies on a healthy quorum matching the expected number of control plane nodes in the cluster.
If ETCD quorum is unhealthy or broken, ETCD will experience issues and may fail.
As many system services rely on ETCD and its database, the system will not function properly or fail as well if ETCD is unhealthy.
This includes kubectl commands which rely on the kube-apiserver. Kube-apiserver is dependent on the health of ETCD and its database.
In the vSphere Supervisor product, it is important to understand which cluster is affected by the ETCD issue. There are individual instances of ETCD on each control plane node in a cluster, and an individual ETCD database runs on the Supervisor Cluster and within each workload cluster in the environment separately.
See "How to SSH into Supervisor Control Plane VMs" from KB article Troubleshooting vSphere Supervisor Control Plane VMs
etcdctl member list -w table
etcdctl --cluster=true endpoint health -w table
An unhealthy ETCD quorum will return that one or more members are in false state and output a related error message:
{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.ssssssZ","logger":"client","caller":"v#@v<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001aa000/<Supervisor VM ETH0>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp <Supervisor VM ETH0>:2379: connect: connection refused\""}
etcdctl --cluster=true endpoint status -w table
Any unhealthy ETCD members will not appear in the above table and an error message will be output:
{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","logger":"etcd-client","caller":"v#@v<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001aa000/<localhost>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp <Supervisor VM ETH0>:2379: connect: connection refused\""}
Failed to get the status of endpoint https://<Supervisor VM ETH0>:2379 (context deadline exceeded)
crictl ps --name etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
<etcd container id> <etcd image id> # days ago Running etcd # <etcd pod id> <etcd pod name>
If the above command does not return an ETCD container in Running state, then ETCD is down or crashing on this particular Supervisor Control Plane VM.
crictl ps -a --name etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
<etcd container id> <etcd image id> # days ago Exited etcd # <etcd pod id> <etcd pod name>
crictl logs <etcd container id>
You can also view the etcd container logs under the following directory:
ls /var/log/pods/kube-system_etcd-*/etcd/
systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; # days ago
If kubelet is not in active (running) state, its logs should be checked and your priority should be to restore kubelet to a healthy, functional state:
journalctl -xeu kubelet
kubectl get kcp -n <workload cluster namespace>
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
<kcp name> <workload cluster name> true true 3 3 3 0 0 #d <VKR version>
ETCD will be in a degraded state if there are 2/3 control plane nodes in the workload cluster.
If there is only 1/3 control plane nodes in the workload cluster, ETCD will fail.
kubectl describe kcp <kcp name> -n <workload cluster namespace>
Ready: true
Ready Replicas: 3
Replicas: 3
Selector: cluster.x-k8s.io/cluster-name=<workload cluster name>,cluster.x-k8s.io/control-plane
Unavailable Replicas: 0
Updated Replicas: 3
Version: <VKR version>
Events: <none>
crictl ps --name etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
<etcd container id> <etcd image id> # days ago Running etcd # <etcd pod id> <etcd pod name>
crictl ps -a --name etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
<etcd container id> <etcd image id> # days ago Exited etcd # <etcd pod id> <etcd pod name>
crictl logs <etcd container id>
You can also view the etcd container logs under the following directory:ls /var/log/pods/kube-system_etcd-*/etcd/
Retrieve the ETCD container ID of a Running ETCD:
crictl ps --name etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
<etcd container id> <etcd image id> # days ago Running etcd # <etcd pod id> <etcd pod name>
Establish the alias for the CLI used to interface with the ETCD database by using the above Running ETCD container ID:
alias etcdctl='crictl exec <etcd container id> etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'
etcdctl member list -w table
See the below for an example of a workload cluster with 3/3 control plane nodes:
etcdctl --cluster=true endpoint health -w table
The below is an example of a workload cluster with 3/3 healthy ETCD quorum:
{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.ssssssZ","logger":"client","caller":"v#@<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002b8fc0/<Control Plane VM IP>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp <Control Plane VM IP>:2379: connect: connection refused\""}
etcdctl --cluster=true endpoint status -w table
See the below for an example of a workload cluster with 3/3 control plane nodes with healthy ETCD:{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.ssssssZ","logger":"etcd-client","caller":"v#@<etcd version>/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002e7180/<localhost>:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp <Control Plane VM IP>:2379: connect: connection refused\""}
Failed to get the status of endpoint https://<Control Plane VM IP>:2379 (context deadline exceeded)
systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; # days ago
If kubelet is not in active (running) state, its logs should be checked and your priority should be to restore kubelet to a healthy, functional state:
journalctl -xeu kubelet
| Issue | Knowledge Base Article (KB) |
| ETCD is failing because the control plane node is out of disk space. | vSphere Supervisor Root Disk Space Full at 100% |
| ETCD is failing because of an expired certificate |
Replace vSphere with Tanzu Guest Cluster/vSphere Kubernetes Cluster Certificates |
|
ETCD is failing on trying to use /bin/sh
|
Liveness probe failed on etcd pods after Supervisor upgrade to v1.30.10 impacting kube-api |
ETCD Database is full
|
vSphere Kubernetes Cluster Unhealthy, Kubectl Commands Failing due to ETCD Database Full or Exceeded |
|
ETCD logs show panic errors
|
etcd and kube-apiserver pods are in CrashLoopBackOff on Guest Cluster after a Power Outage Event |
|
ETCD logs repeat etcd leader changed: etcdserver: leader changed |
Kubectl Commands Failing with "etcdserver: leader changed" |
| ETCD quorum shows 3/3 control plane nodes, but one of the ETCD members does not match the existing control plane nodes of the cluster | Stale ETCD Member Prevents Workload Cluster Upgrade |
| ETCD is running on each control plane node, but its logs report that it cannot connect to the other control plane nodes. | ETCD Unhealthy in Control Plane Nodes due to VMs Unable to Communicate |
| ETCD is unhealthy because one control plane node was manually deleted | Recover Guest Cluster after a Control Plane Node was Deleted Manually |