The kube-apiserver could be failing with the error message "etcdserver: mvcc: database space exceeded" showing in the kube-apiserver logs.
The etcd logs could also show the error message "etcdserver: no space" repeatedly.
VMware Tanzu Kubernetes Grid Integrated Edition
The default quota for the etcd db file size is 2GB. When the db file size reaches 2GB, then etcd will show errors that it cannot write to the db anymore due to no more space.
TKGI clusters runs compact with etcd every 5 minutes, the db file size would not reduce, because the compact only releases space from unneeded revisions, it does not reduce file size. It is usually sufficient to prevent the db file from increasing infinitely. However too a massive / large keys and values stored in etcd could require more than 2GB storage space due to:
At first, please check etcd if it is filled by a massive number and/or large objects with etcdctl. If you confirm a massive number and/or large objects were recreated unnecessarily, please review the source side if those objects could be reduced. This will resolve the space issue from root.
|
// bosh ssh into any master node and switch to root user $ sudo -i // check total value size
// check total value size for particular object type, such as secret
// list size of each object with the particular type ~# for key in $( |
The second, if the issue is due to auto-compact failure other than a massive number and/or large objects, you can manually reclaim the disk space consumed by the db file with following steps.
Log into any master node, and execute the following command. When it's done, copy the snapshot.db file to a safe place. Setting the env variable LOCAL to true is to effectively set the etcd endpoints to just the local master instance, which is needed for the snapshot command to work because it requires only 1 instance.
|
|
etcd adopts MVCC mechanism to manage the keyspace. It actually never removes data, instead it always appends new data even for the case of deleting a key/value. So we can compact the history to avoid eventual storage space exhaustion. Please log into any master node and execute the following commands.
Note that you only need to execute the commands one time on one master node.
Execute command below to get the latest revision.
$ /var/vcap/jobs/etcd/bin/etcdctl endpoint status -w json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*' |
Then, execute command below to compact away old revisions. Make sure to change $REVISION below to the revision value returned by the previous command.
$ /var/vcap/jobs/etcd/bin/etcdctl compact $REVISION |
For each master node, run the following commands.
|
## Once "monit summary" shows etcd is running, repeat steps in the next master node. No need to worry about errors in the etcd logs for now. No need to wait for other jobs to be in 'running' state. |
Log into any master node, and execute commands below. Note that you only need to execute the commands one time on one master node.
# Step 1: List all alarms$ /var/vcap/jobs/etcd/bin/etcdctl alarm list# Step 2: Disarms all alarms$ /var/vcap/jobs/etcd/bin/etcdctl alarm disarm |