Exceeding the default 2GB ETCD database allocation can trigger a NOSPACE alarm, leading to master election failures and subsequent VIP loss.
1. Setup etcdctl alias. This alias redefines the etcdctl command to include specific certificate and key paths for authentication.
alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'
2. Fetch cluster information
etcdctl -w table member list
etcdctl -w table endpoint --cluster status
etcdctl -w table endpoint --cluster health
3. Take backup of etcd & Verify the size of etcd backup:
etcdctl snapshot save /root/etcd-backup-new.db
etcdctl snapshot status /root/etcd-backup.db -w table
4. Copy the backup file to CP node
scp etcd-backup.db capv@<CP NODE>:/tmp
5. Temporarily increase the etcd database size 3GB:
Note: The default maximum etcd database size is 2GB. This solution temporarily increases it to 3GB as a workaround. Once the database size is reduced, remove the --quota-backend-bytes line from etcd.yaml on all control plane nodes.
6. Pause cluster reconciliation:
SSH into TCA-CP corresponding to the Management Cluster. Get the MGMT Cluster Context
kubectl config get-contexts
Use the MGMT Cluster context (Change XXXX with the name of the MGMT Cluster that you get from the above command)
kubectl config use-context XXXX
Run this command to pause the cluster
kubectl patch cluster <Clustername> -n tkg-system --type merge -p '{"spec":{"paused": true}}'
7. Stop services on all nodes: All the master nodes should be stopped!
systemctl stop kubelet.service
export API_CONTAINER_ID=$(crictl ps -q --name kube-apiserver)
export ETCD_CONTAINER_ID=$(crictl ps -q --name etcd)
crictl stop $API_CONTAINER_ID
crictl stop $ETCD_CONTAINER_ID
8. Backup etcd data:
mkdir /root/etcdbkp
cp -r /etc/kubernetes/manifests /root/etcdbkp/kubernetes-manifests
cp -r /var/lib/etcd /root/etcdbkp/etcd-backup
alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'
9. Offline etcd defrag:
etcdctl defrag --data-dir /var/lib/etcd/
Restart services on all the CP nodes:
systemctl start kubelet
10. Disarm the alarm to ignore the space exhaustion
etcdctl alarm disarm
Steps to verify the size of the etcd database
alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'
etcdctl -w table endpoint --cluster status
etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }'| sort -nr
etcdctl get /registry/events --prefix=true --keys-only | grep -v ^$ | awk -F'/' {'print $4'} | sort | uniq -c
etcdctl defrag --cluster
Using above commands one may identify the events in a particular namespace