Error message "etcdserver: mvcc: database space exceeded" or "etcdserver: no space"
search cancel

Error message "etcdserver: mvcc: database space exceeded" or "etcdserver: no space"

book

Article ID: 298690

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The kube-apiserver could be failing with the error message "etcdserver: mvcc: database space exceeded" showing in the kube-apiserver logs.  

The etcd logs could also show the error message "etcdserver: no space" repeatedly.

Environment

VMware Tanzu Kubernetes Grid Integrated Edition

Cause

The default quota for the etcd db file size is 2GB. When the db file size reaches 2GB, then etcd will show errors that it cannot write to the db anymore due to no more space.  

Resolution

Reclaim the disk space consumed by the db file in such situation.  

Step 1: Backup the db

Log into any master node, and execute the following command. When it's done, copy the snapshot.db file to a safe place.  Setting the env variable LOCAL to true is to effectively set the etcd endpoints to just the local master instance, which is needed for the snapshot command to work because it requires only 1 instance.

$ LOCAL=true /var/vcap/jobs/etcd/bin/etcdctl snapshot save snapshot.db


Step 2: Compact

etcd adopts MVCC mechanism to manage the keyspace. It actually never removes data, instead it always appends new data even for the case of deleting a key/value. So we can compact the history to avoid eventual storage space exhaustion.  Please log into any master node and execute the following commands. 

Note that you only need to execute the commands one time on one master node.

Execute command below to get the latest revision.

$ /var/vcap/jobs/etcd/bin/etcdctl endpoint status -w json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'


Then, execute command below to compact away old revisions.  Make sure to change $REVISION below to the revision value returned by the previous command.

$ /var/vcap/jobs/etcd/bin/etcdctl compact $REVISION 


Step 3: Defragment

For each master node, run the following commands.

## Step 3.1: Stop etcd and all other kubernetes-related jobs (such as kube-apiserver, ncp, csi-*, kube-*, and vsphere-*).  These kubernetes-related jobs would be flapping or crash-looping because etcd is not healthy.  Do not stop the bosh-*, blackbox and system-* jobs. (Note: Stopping all the other kubernetes-related jobs would save time when running the succeeding "monit start/stop" commands in this step and in step 3.5, because monit processes one job at a time and waits for a success or fail until a timeout is reached according to the job's timeout setting before moving on to the next job.  If you don't stop the other kubernetes-related jobs, then it could take more than 10 minutes for monit to process your request for etcd.)

$ monit stop etcd
$ monit stop $KUBERNETES-RELATED-JOB


## Step 3.2: Backup the /var/vcap/store/etcd/member.  Make sure to change $BACKUP-PATH accordingly and that it has enough disk space.  Run `du -sh /var/vcap/store/etcd/member` to know how much disk space you would need.
$ cp -r /var/vcap/store/etcd/member $BACKUP-PATH
 
## Step 3.3: Defragment. (Note: You could still use "etcdctl --defrag" as of TKGI v1.22 but if there is a need to use etcdutl then it is available in /var/vcap/packages/etcd/bin/etcdutl.)
$ /var/vcap/jobs/etcd/bin/etcdctl defrag --data-dir /var/vcap/store/etcd
 
## Step 3.4: Change ownership
$ chown vcap:vcap /var/vcap/store/etcd/member/snap/db
 
## Step 3.5: Start etcd and all other kubernetes-related jobs that were stopped in step 3.1.
$ monit start etcd
$ monit start $KUBERNETES-RELATED-JOB

## Once "monit summary" shows etcd is running, repeat steps in the next master node.  No need to worry about errors in the etcd logs for now.  No need to wait for other jobs to be in 'running' state.

 

 


Step 4: Disarm all alarms

Log into any master node, and execute commands below.  Note that you only need to execute the commands one time on one master node.

# Step 1: List all alarms
$ /var/vcap/jobs/etcd/bin/etcdctl alarm list
 
# Step 2: Disarms all alarms
$ /var/vcap/jobs/etcd/bin/etcdctl alarm disarm