Procedure to recover ETCD when only one control Plane node is functional

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

Running 'kubectl get nodes' shows control plane reporting unhealthy status
Node conditions show KubeletNotReady or NetworkUnavailable
Etcd pods in unhealthy state (e.g. Crashloopbackoff, Pending, Not Ready, Error, or continuously restarting)
Cluster appears to have lost majority of control plane nodes and only one control-plane node is functional.
etcdctl endpoint status -w table command output shows "etcdserver: no leader"

Environment

TKGm 2.5.X

Cause

The etcd cluster has lost quorum, leaving only a single healthy control-plane node requiring a state reset and rebuild

Resolution

Backup and forcefully reset the degraded etcd cluster to a single-node configuration using the last healthy control plane node, thereby restoring API server quorum and functionality.

Subsequently, resuming Cluster API reconciliation triggers the automated provisioning and bootstrapping of replacement nodes to restore high availability of the etcd cluster, by following the below steps.

Follow the step by step instructions below.

Preparation and Pre-checks

Set Context (Mandatory)

kubectl config get-contexts
Switch Context

kubectl config use-context <mgmt-cluster-context>
Verify Context

kubectl get nodes
Identify Healthy Control Plane Node

kubectl get nodes -o wide
kubectl get pods -n kube-system | grep etcd

Ensure: Only 1 etcd pod is running and it will be the recovery node.
Confirm etcd cluster is unhealthy

kubectl exec -n kube-system <etcd-pod> -- etcdctl endpoint health --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key

If the above command is successful, DO NOT proceed.

Pause Reconciliation and Backup.

Pause the reconciliation

kubectl patch cluster <cluster-name> -n <namespace> \ --type merge -p '{"spec":{"paused":true}}'
Confirm reconciliation is paused

kubectl get cluster <cluster-name> -n <namespace> -o yaml | grep paused

Expected outout: "paused: true"
Backup (Mandatory)

mkdir -p /root/etcd-backup cp -r /etc/kubernetes/manifests /root/etcd-backup/kubernetes-manifests cp -r /var/lib/etcd /root/etcd-backup/etcd-data ls -l /root/etcd-backup du -sh /root/etcd-backup/*

Optional external backup:

scp -r /root/etcd-backup user@<remote-host>:/backup/
Take etcd Snapshot (Mandatory)

kubectl exec -it -n kube-system <etcd-pod> -- shexport ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /tmp/etcd-snapshot.dbetcdctl snapshot status /tmp/etcd-snapshot.db -w tableexit

Copy the snapshot:

kubectl cp kube-system/<etcd-pod>:/tmp/etcd-snapshot.db ./etcd-snapshot.db ls -lh etcd-snapshot.db

Reset and Reconfigure

Stop the control plane components

systemctl stop kubelet
Verify containers stopped

crictl ps | grep -E "etcd|apiserver"
If still running, stop manually using crictl command

crictl stop $(crictl ps -q --name kube-apiserver) 2>/dev/null

crictl stop $(crictl ps -q --name etcd) 2>/dev/null
Reset etcd State

rm -rf /var/lib/etcd mkdir -p /var/lib/etcd
Verify Certificates

ls -l /etc/kubernetes/pki/etcd/

The below certificate files should be available
- ca.crt
- server.crt / server.key
- peer.crt / peer.key
If missing run the below command

kubeadm init phase certs all --config /etc/kubernetes/kubeadm-config.yaml
Update etcd Manifest

vi /etc/kubernetes/manifests/etcd.yaml

Update:

--name=<node-name> --initial-cluster=<node-name>=https://<node-ip>:2380 --initial-cluster-state=new --initial-advertise-peer-urls=https://<node-ip>:2380 --listen-peer-urls=https://<node-ip>:2380

Ensure: Only one node present and the other old nodes removed
Start kubelet and verify the status

systemctl start kubelet
systemctl status kubelet
Verify etcd Running

crictl ps | grep etcd
Check etcd Logs

crictl logs $(crictl ps -q --name etcd)

Ensure: There is no cluster ID mismatch or TLS errors.
Validate etcd

export ETCDCTL_API=3etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.keyetcdctl endpoint health

Expected:
- Only 1 member
- Status = started
- Endpoint healthy

Validate and Resume reconciliation

Restore kubectl access by exporting the admin kubeconfig

export KUBECONFIG=/etc/kubernetes/admin.conf kubectl get nodes
Check kube-vip

kubectl get pods -n kube-system | grep vip
Identify and Blocking Webhooks (During etcd recovery, orphaned admission webhooks will block API reconciliation)

kubectl get validatingwebhookconfiguration kubectl get mutatingwebhookconfiguration
Delete the blocking webhook

kubectl delete validatingwebhookconfiguration <webhook-name> kubectl delete mutatingwebhookconfiguration <webhook-name>
Global Deletion : Required only If the above targeted deletion fails or specific blocking webhooks cannot be identified, execute global deletion.

Warning: This temporarily disables all cluster security policies and sidecar injections until managing controllers recreate them.

kubectl delete validatingwebhookconfiguration --all kubectl delete mutatingwebhookconfiguration --all
Resume Reconciliation

kubectl patch cluster <cluster-name> -n <namespace> --type merge -p '{"spec":{"paused":false}}'

Monitor Recovery and Final Validation

Monitor recovery

kubectl get machine -n <namespace> -w
kubectl get nodes -w
Force delete stuck machines if needed.

kubectl delete machine <name> -n <namespace> --force --grace-period=0
Validate the full etcd cluster returns to 3 members

kubectl get pods -n kube-system | grep etcdkubectl exec -n kube-system <etcd-pod> -- etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key

Expected: 3 members and all are started.
Cluster Health Check

kubectl get nodes kubectl get pods -A
Rollback (If the etcd cluster is still unhealthy or if there are any failures)

systemctl stop kubelet rm -rf /var/lib/etcd cp -r /root/etcd-backup/etcd-data /var/lib/etcd cp -r /root/etcd-backup/kubernetes-manifests/* /etc/kubernetes/manifests/ systemctl start kubelet

Note:

Do not run: etcdctl member add
Do not reuse broken nodes with stale data or incorrect certificates.