Reducing Etcd Database Size: An Offline Approach to fix

Products

VMware Telco Cloud Automation

Issue/Introduction

Exceeding the default 2GB ETCD database allocation can trigger a NOSPACE alarm, leading to master election failures and subsequent VIP loss.

Resolution

To resolve the issues, etcd db will need to be defragmented one node at a time in offline mode. Due to some known issues in etcd v3.5.0 - 3.5.5, it is not recommended to perform an online defragmentation.

1. Setup etcdctl alias. This alias redefines the etcdctl command to include specific certificate and key paths for authentication.

alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'

2. Fetch cluster information

etcdctl -w table member list
etcdctl -w table endpoint --cluster status
etcdctl -w table endpoint --cluster health

3. Take backup of etcd & Verify the size of etcd backup:

etcdctl snapshot save /root/etcd-backup-new.db
etcdctl snapshot status /root/etcd-backup.db -w table

4. Copy the backup file to CP node

scp etcd-backup.db capv@<CP NODE>:/tmp

5. Temporarily increase the etcd database size 3GB:

SSH into each management control plane (MCP) node: Use the command ssh capv@<ip_address>.
Back up the etcd configuration: Navigate to /etc/kubernetes/manifests/ and copy etcd.yaml to a safe location.
Edit the etcd YAML: Add the line "--quota-backend-bytes=3221225472" within the -etcd section of etcd.yaml.
Restart etcd: Save the changes and exit the editor. The etcd container should automatically restart.
Monitor database size: Verify that the database size increases as expected.

Note: The default maximum etcd database size is 2GB. This solution temporarily increases it to 3GB as a workaround. Once the database size is reduced, remove the --quota-backend-bytes line from etcd.yaml on all control plane nodes.

6. Pause cluster reconciliation:

SSH into TCA-CP corresponding to the Management Cluster. Get the MGMT Cluster Context

kubectl config get-contexts

Use the MGMT Cluster context (Change XXXX with the name of the MGMT Cluster that you get from the above command)

kubectl config use-context XXXX

Run this command to pause the cluster

kubectl patch cluster <Clustername> -n tkg-system --type merge -p '{"spec":{"paused": true}}'

7. Stop services on all nodes: All the master nodes should be stopped!

systemctl stop kubelet.service
export API_CONTAINER_ID=$(crictl ps -q --name kube-apiserver)
export ETCD_CONTAINER_ID=$(crictl ps -q --name etcd)
crictl stop $API_CONTAINER_ID
crictl stop $ETCD_CONTAINER_ID

8. Backup etcd data:

mkdir /root/etcdbkp 
cp -r /etc/kubernetes/manifests /root/etcdbkp/kubernetes-manifests
cp -r /var/lib/etcd /root/etcdbkp/etcd-backup

alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'

9. Offline etcd defrag:

etcdctl defrag --data-dir /var/lib/etcd/

Restart services on all the CP nodes:

systemctl start kubelet

10. Disarm the alarm to ignore the space exhaustion

etcdctl alarm disarm

Additional Information

Steps to verify the size of the etcd database

alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'

etcdctl -w table endpoint --cluster status

etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }'| sort -nr

etcdctl get /registry/events --prefix=true --keys-only | grep -v ^$ | awk -F'/' {'print $4'} | sort | uniq -c

etcdctl defrag --cluster

Using above commands one may identify the events in a particular namespace