Procedure to recover ETCD when only one control Plane node is functional
search cancel

Procedure to recover ETCD when only one control Plane node is functional

book

Article ID: 436120

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

  • Running 'kubectl get nodes' shows control plane reporting unhealthy status
  • Node conditions show KubeletNotReady or NetworkUnavailable
  • Etcd pods in unhealthy state (e.g. Crashloopbackoff, Pending, Not Ready, Error, or continuously restarting)
  • Cluster appears to have lost majority of control plane nodes and only one control-plane node is functional.
  • etcdctl endpoint status -w table command output shows "etcdserver: no leader"

Environment

TKGm 2.5.X

Cause

The etcd cluster has lost quorum, leaving only a single healthy control-plane node requiring a state reset and rebuild

Resolution

Backup and forcefully reset the degraded etcd cluster to a single-node configuration using the last healthy control plane node, thereby restoring API server quorum and functionality.

Subsequently, resuming Cluster API reconciliation triggers the automated provisioning and bootstrapping of replacement nodes to restore high availability of the etcd cluster, by following the below steps.

Follow the step by step instructions below.

Preparation and Pre-checks

  1. Set Context (Mandatory)

      kubectl config get-contexts

  2. Switch Context

      kubectl config use-context <mgmt-cluster-context>

  3. Verify Context

     
    kubectl get nodes

  4. Identify Healthy Control Plane Node

    kubectl get nodes -o wide
    kubectl get pods -n kube-system | grep etcd

    Ensure: Only 1 etcd pod is running and it will be the recovery node. 
  5. Confirm etcd cluster is unhealthy 

    kubectl exec -n kube-system <etcd-pod> -- etcdctl endpoint health --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key

    If the above command is successful, DO NOT proceed.

Pause Reconciliation and Backup.

  1. Pause the reconciliation 

    kubectl patch cluster <cluster-name> -n <namespace> \ --type merge -p '{"spec":{"paused":true}}'

  2. Confirm reconciliation is paused

    kubectl get cluster <cluster-name> -n <namespace> -o yaml | grep paused

    Expected outout: "paused: true"

  3.  Backup (Mandatory)

    mkdir -p /root/etcd-backup
    cp -r /etc/kubernetes/manifests /root/etcd-backup/kubernetes-manifests
    cp -r /var/lib/etcd /root/etcd-backup/etcd-data
    ls -l /root/etcd-backup
    du -sh /root/etcd-backup/*


    Optional external backup:

    scp -r /root/etcd-backup user@<remote-host>:/backup/

  4. Take etcd Snapshot (Mandatory)

    kubectl exec -it -n kube-system <etcd-pod> -- shexport ETCDCTL_API=3
    etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt  --key=/etc/kubernetes/pki/etcd/server.key snapshot save /tmp/etcd-snapshot.dbetcdctl snapshot status /tmp/etcd-snapshot.db -w tableexit


    Copy the snapshot:

    kubectl cp kube-system/<etcd-pod>:/tmp/etcd-snapshot.db ./etcd-snapshot.db ls -lh etcd-snapshot.db

Reset and Reconfigure

  1. Stop the control plane components

    systemctl stop kubelet

  2. Verify containers stopped

    crictl ps | grep -E "etcd|apiserver"

  3. If still running, stop manually using crictl command

    crictl stop $(crictl ps -q --name kube-apiserver) 2>/dev/null

    crictl stop $(crictl ps -q --name etcd) 2>/dev/null

  4. Reset etcd State

    rm -rf /var/lib/etcd
    mkdir -p /var/lib/etcd


  5. Verify Certificates

    ls -l /etc/kubernetes/pki/etcd/

    The below certificate files should be available

    • ca.crt
    • server.crt / server.key
    • peer.crt / peer.key

  6. If missing run the below command 

    kubeadm init phase certs all --config /etc/kubernetes/kubeadm-config.yaml

  7. Update etcd Manifest

    vi /etc/kubernetes/manifests/etcd.yaml

    Update:

    --name=<node-name>
    --initial-cluster=<node-name>=https://<node-ip>:2380
    --initial-cluster-state=new
    --initial-advertise-peer-urls=https://<node-ip>:2380
    --listen-peer-urls=https://<node-ip>:2380


    Ensure: Only one node present and the other old nodes removed

  8. Start kubelet and verify the status

    systemctl start kubelet
    systemctl status kubelet

  9. Verify etcd Running

    crictl ps | grep etcd

  10. Check etcd Logs 

    crictl logs $(crictl ps -q --name etcd)

    Ensure: There is no cluster ID mismatch or TLS errors.

  11. Validate etcd

     export ETCDCTL_API=3etcdctl member list  --endpoints=https://127.0.0.1:2379  --cacert=/etc/kubernetes/pki/etcd/ca.crt  --cert=/etc/kubernetes/pki/etcd/server.crt  --key=/etc/kubernetes/pki/etcd/server.keyetcdctl endpoint health

    Expected: 

    • Only 1 member
    • Status = started
    • Endpoint healthy

Validate and  Resume reconciliation

  1. Restore kubectl access by exporting the admin kubeconfig

    export KUBECONFIG=/etc/kubernetes/admin.conf kubectl get nodes

  2. Check kube-vip

    kubectl get pods -n kube-system | grep vip

  3. Identify and Blocking Webhooks (During etcd recovery, orphaned admission webhooks will block API reconciliation)

    kubectl get validatingwebhookconfiguration
    kubectl get mutatingwebhookconfiguration


  4. Delete the blocking webhook

    kubectl delete validatingwebhookconfiguration <webhook-name>
    kubectl delete mutatingwebhookconfiguration <webhook-name>


  5. Global Deletion : Required only If the above targeted deletion fails or specific blocking webhooks cannot be identified, execute global deletion.

    Warning: This temporarily disables all cluster security policies and sidecar injections until managing controllers recreate them.

    kubectl delete validatingwebhookconfiguration --all
    kubectl delete mutatingwebhookconfiguration --all


  6. Resume Reconciliation 

    kubectl patch cluster <cluster-name> -n <namespace> --type merge -p '{"spec":{"paused":false}}'

Monitor Recovery and Final Validation

  1. Monitor recovery

    kubectl get machine -n <namespace> -w
    kubectl get nodes -w

  2. Force delete stuck machines if needed.

    kubectl delete machine <name> -n <namespace> --force --grace-period=0

  3. Validate the full etcd cluster returns to 3 members

    kubectl get pods -n kube-system | grep etcdkubectl exec -n kube-system <etcd-pod> -- etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key

    Expected: 3 members and all are started.
  4. Cluster Health Check

    kubectl get nodes
    kubectl get pods -A


  5. Rollback (If the etcd cluster is still unhealthy or if there are any failures)

    systemctl stop kubelet
    rm -rf /var/lib/etcd
    cp -r /root/etcd-backup/etcd-data /var/lib/etcd
    cp -r /root/etcd-backup/kubernetes-manifests/* /etc/kubernetes/manifests/
    systemctl start kubelet

Note:

  1. Do not run: etcdctl member add
  2. Do not reuse broken nodes with stale data or incorrect certificates.