vSphere Kubernetes Cluster Unhealthy, Kubectl Commands Failing due to ETCD Database Full or Exceeded
search cancel

vSphere Kubernetes Cluster Unhealthy, Kubectl Commands Failing due to ETCD Database Full or Exceeded

book

Article ID: 379425

calendar_today

Updated On:

Products

VMware vSphere 7.0 with Tanzu VMware Tanzu Kubernetes Grid Service (TKGs) vSphere with Tanzu VMware vSphere with Tanzu

Issue/Introduction

Kubectl commands are failing in the affected vSphere Kubernetes cluster context.

 

When connected to the Supervisor cluster context, the following symptoms are present:

  • Describing the affected cluster shows the below error message for the control plane nodes in the cluster:
    • kubectl describe cluster -n <affected cluster namespace> <affected cluster>
    • "Following machines are reporting unknown etcd member status"

When connected to the affected cluster context, the following symptoms are present:

  • All kubectl commands are failing or timing out.

  • The logs for kube-apiserver show the following error message:
    • kubectl logs -n kube-system <kube-apiserver pod name>
    • "etcdserver: mvcc: database space exceeded"
  • The logs for ETCD show error messages containing the below error:
    • "alarm:NOSPACE"

 

When SSH to one of the affected cluster's control plane nodes, the following symptoms are present:

  • The ETCD database for the control plane node is equal to or greater than 2.0GB:
    • ls -ltrh /var/lib/etcd/member/snap

 

Environment

vSphere with Tanzu 7.0

vSphere with Tanzu 8.0

This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

ETCD's keyspace data limit has been reached or exceeded.

The default ETCD database storage size limit is 2 GB.

Once this limit is reached or exceeded, ETCD will crash.

Kube-apiserver is reliant on ETCD being healthy.

Without kube-apiserver in a healthy state, kubectl commands will fail.

Resolution

Please open a ticket to VMware by Broadcom Technical Support referencing this KB for assistance in cleaning up ETCD database and restoring it to operational state.


Once ETCD is operational again, the root cause of what is filling up ETCD database will need to be investigated.

Otherwise, this may happen again at a rate depending on how quickly the database is actively being filled.