vSphere Kubernetes Cluster Unhealthy, due to ETCD Database Full

search cancel

vSphere Kubernetes Cluster Unhealthy, due to ETCD Database Full

book

Article ID: 345903

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

kubectl commands fails in the affected vSphere Kubernetes cluster.
Kubernetes Status under Supervisor cluster shows following error -

Executing below commands from a healthy Supervisor Control Plane node gives the error "context deadline exceeded"

"etcdctl endpoint health status --cluster=true -w table"
"etcdctl endpoint status --cluster=true -w table"

Executing command "crictl ps" on the unhealthy Supervisor node shows the etcd and kube-apiserver containers crashing and restarting in a loop.
"/var/log/containers/etcd-<>_kube-system_etcd-<>.log" file in the unhealthy Supervisor node shows the following error:

<timestamp>.357959735Z stderr F {"level":"fatal","ts":"YYYY-MM-DDTHH:MM:SS.XXXZ","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 908, fileSize(15368192) - offset(15368064) - padBytes(4) = entryLimit(124)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:32\nruntime.main\n\truntime/proc.go:250"}

On the unhealthy Supervisor node, etcd is not writing WAL files under directory /var/lib/etcd/member/wal/, whereas in a healthy node these files are continuously updated. In the following example etcd stopped writing WAL files when the issue occurred.

Environment

VMware vSphere Kubernetes Services

Cause

The error is caused by etcd attempting to write an entry that exceeds the remaining space in the current WAL segment.

One possible reason for etcd to stop writing WAL files in "/var/lib/etcd/member/wal/" and ending up in a crash loop is a disk space issue in the Supervisor nodes.

Resolution

Verify there's enough disk space in the Supervisor nodes with "df -h". Usage of around 80% or below should be healthy.
If there is no enough free disk space , follow below steps to perform clean-up :
- Validate /root partition if any old log files are present and remove it.
- Run the following commands
- cd /var/log/vmware/audit
- ```
rm *log.gz
```
- ```
journalctl --vacuum-time=2d
```

If issue persists after sufficient disk space then follow the below steps :

Move the most recent incomplete WAL file out of the /var/lib/etcd/member/wal/ directory.
- mv /var/lib/etcd/member/wal/<filename>.wal /root/
The next time kubelet restarts the etcd container, it will start writing files again into "/var/lib/etcd/member/wal/" and stabilize.

Feedback

thumb_up Yes

thumb_down No