This KB article is written regarding a workload cluster showing as unhealthy due to networking issues with ETCD.
While connected to the Supervisor cluster context, the following symptoms are present:
kubectl get machines -n <workload cluster namespace>
If the cluster is intended to have 3 control plane nodes and there are 2 control plane nodes missing, that is a separate issue than this KB article.
kubectl get kcp -n <workload cluster namespace>
While connected to the affected workload cluster context, the following symptoms are present:
kubectl get pods -A
While SSH to each control plane nodes in the affected workload cluster, the following symptoms are observed on each control plane VM:
crictl ps --name etcd
CONTAINER IMAGE CREATED STATE NAME
<container ID> <image ID> # days ago Running etcd
crictl logs <etcd container ID>
{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"<etcd peer ID>","error":"failed to dial <etcd peer ID> on stream Message (dial tcp <control plane IP>:2380: i/o timeout)"}
{"level":"info","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"<etcd peer ID>"}
{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"<etcd peer ID>","rtt":"0s","error":"dial tcp <control plane IP>:2380: i/o timeout"
systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC;
vSphere Supervisor
This issue can occur regardless of whether or not the cluster is managed by Tanzu Mission Control (TMC)
When ETCD is not healthy in a cluster's control plane nodes, system pods in the cluster will fail.
Most visibly, kubectl commands will not work because the kube-apiserver is crashing. Kube-apiserver is dependent on ETCD being healthy.
ETCD maintains the database in the cluster and requires a full quorum to operate in a healthy state.
In this scenario, a networking issue is preventing ETCD processes in each control plane VM from communicating to each other.
As a result, ETCD cannot maintain a healthy quorum and returns errors that it cannot talk to its quorum peers.
The following steps are written on how to confirm that the ETCD issue is caused by the control plane nodes in the workload cluster being unable to communicate over ETCD port 2379.
kubectl get vm -o wide -n <workload cluster namespace>
If the workload cluster is expected to have 3 control plane VMs, but there is only one, that is a separate issue than this KB.
crictl ps --name etcd
If ETCD container process is not running, that is a separate issue than this KB article.
crictl logs <etcd container ID>
{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"<etcd peer ID>","error":"failed to dial <etcd peer ID> on stream Message (dial tcp <control plane IP>:2380: i/o timeout)"}
{"level":"info","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"<etcd peer ID>"}
{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.65021Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"<etcd peer ID>","rtt":"0s","error":"dial tcp <control plane IP>:2380: i/o timeout"
curl -vk <this control plane's IP>:2379
* Trying <this control plane's IP>:2379...
* Connected to <this control plane's IP> (<this control plane's IP>) port 2379 (#0)
> GET / HTTP/1.1
> Host: <this control plane's IP>:2379
> User-Agent: curl/#.#.#
> Accept: */*
>
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server
curl -vk <different control plane IP>:2379
* Trying <different control plane IP>:2379...