ETCD Unhealthy in Control Plane Nodes due to VMs Unable to Communicate

Products

Tanzu Kubernetes Runtime

Issue/Introduction

This KB article is written regarding a workload cluster showing as unhealthy due to networking issues with ETCD.

While connected to the Supervisor cluster context, the following symptoms are present:

All control plane nodes for the affected workload cluster are present and Running:
```
kubectl get machines -n <workload cluster namespace>
```
If the cluster is intended to have 3 control plane nodes and there are 2 control plane nodes missing, that is a separate issue than this KB article.
The kubeadm-control-plane object (KCP) which manages the control plane nodes shows that zero (0) control plane nodes are Ready:
```
kubectl get kcp -n <workload cluster namespace>
```

While connected to the affected workload cluster context, the following symptoms are present:

All kubectl commands are failing with an error:
```
kubectl get pods -A
```

While SSH to each control plane nodes in the affected workload cluster, the following symptoms are observed on each control plane VM:

The ETCD container process is stable in Running state and has not been crashing repeatedly:

crictl ps --name etcd

CONTAINER           IMAGE               CREATED             STATE               NAME
<container ID>      <image ID>          # days ago          Running             etcd

ETCD logs show similar errors to the below, indicating that it cannot communicate to its peers in its quorum:

crictl logs <etcd container ID>

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"<etcd peer ID>","error":"failed to dial <etcd peer ID> on stream Message (dial tcp <control plane IP>:2380: i/o timeout)"}
{"level":"info","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"<etcd peer ID>"}

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"<etcd peer ID>","rtt":"0s","error":"dial tcp <control plane IP>:2380: i/o timeout"

Kubelet service shows as Active: active (running):

systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC;

Environment

vSphere Supervisor

This issue can occur regardless of whether or not the cluster is managed by Tanzu Mission Control (TMC)

Cause

When ETCD is not healthy in a cluster's control plane nodes, system pods in the cluster will fail.

Most visibly, kubectl commands will not work because the kube-apiserver is crashing. Kube-apiserver is dependent on ETCD being healthy.

ETCD maintains the database in the cluster and requires a full quorum to operate in a healthy state.

In this scenario, a networking issue is preventing ETCD processes in each control plane VM from communicating to each other.

As a result, ETCD cannot maintain a healthy quorum and returns errors that it cannot talk to its quorum peers.

Resolution

The following steps are written on how to confirm that the ETCD issue is caused by the control plane nodes in the workload cluster being unable to communicate over ETCD port 2379.

Connect into the Supervisor cluster context
Note down the IPs of all control plane nodes for the affected workload cluster:
```
kubectl get vm -o wide -n <workload cluster namespace>
```
If the workload cluster is expected to have 3 control plane VMs, but there is only one, that is a separate issue than this KB.
SSH into each of the workload cluster's control plane VMs
Check on the status of the etcd container process on each control plane VM:
```
crictl ps --name etcd
```
If ETCD container process is not running, that is a separate issue than this KB article.

Confirm that the ETCD logs show errors that it cannot communicate to the other control plane VM IPs:

crictl logs <etcd container ID>

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"<etcd peer ID>","error":"failed to dial <etcd peer ID> on stream Message (dial tcp <control plane IP>:2380: i/o timeout)"}
{"level":"info","ts":"YYYY-MM-DDTHH:MM:SS.sssssZ","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"<etcd peer ID>"}

{"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.65021Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"<etcd peer ID>","rtt":"0s","error":"dial tcp <control plane IP>:2380: i/o timeout"

Perform checks that all control plane VMs can connect to its own running ETCD over port 2379. Ping is disabled by default in this product's VMs.

curl -vk <this control plane's IP>:2379

*   Trying <this control plane's IP>:2379...
* Connected to <this control plane's IP> (<this control plane's IP>) port 2379 (#0)
> GET / HTTP/1.1
> Host: <this control plane's IP>:2379
> User-Agent: curl/#.#.#
> Accept: */*
>
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server

Check that all control plane VMs cannot communicate to other control plane VMs over ETCD port 2379 and times out:
```
curl -vk <different control plane IP>:2379

*   Trying <different control plane IP>:2379...
```
With the above checks we have confirmed on the following:
1. ETCD is running on each control plane VM and not crashing.
2. ETCD logs state that it is unable to communicate to other control plane VMs.
3. Each control plane VM receives a Connected response from its own ETCD process, indicating ETCD is working .
4. However, control plane VMs cannot communicate to other control plane VMs over ETCD port 2379, despite ETCD in Running state on all control plane VMs.
Work with your networking team to find the cause of why the cluster's control plane VMs can't communicate to each other over their eth0 interface for ETCD port 2379.
- This may be caused by blocked ETCD ports of 2380 and 2379.
  - See the Ports and Protocols github for vSphere Supervisor
- For NSX-T environments, there is a known issue where a NSX upgrade from 4.1 to 4.2 results in a similar scenario.
  - See the following KB article for details: IP address subnets missing from an antrea NSGroup after an upgrade from 4.1.x to 4.2.x
Once the networking communication is successful between control plane VMs, ETCD will recover and the cluster will stabilize automatically, provided that there are no further environmental issues.