Single Control Plane Node cluster upgrade stuck because of the etcd quorum loss.
search cancel

Single Control Plane Node cluster upgrade stuck because of the etcd quorum loss.

book

Article ID: 432867

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • During a Tanzu Kubernetes Cluster (TKC) upgrade, the primary (1st) Control Plane (CP) node’s etcd and kube-apiserver services entered a CrashLoopBackOff state. This failure occurred because the 1st node lost quorum and could not establish a connection with the 2nd CP node (provisioned during the rolling upgrade), leading to a "No Leader" state. Consequently, the load balancer (LB) pool member for the old node flapped continuously as the services attempted to restart.

  • The etcd logs explicitly show the node stuck in the pre-candidate state and failing to find a leader. In a Raft-based cluster, if a node cannot communicate with the majority of the members defined in its state, it cannot transition to "Leader" or "Follower" status, causing the local etcd process to exit or fail health checks, which in turn causes the kube-apiserver (which depends on etcd) to crash.

    Symptom: kube-apiserver and etcd pods failing on the original CP node.

    Log Evidence (etcd logs):

    "msg":"failed to publish local member to cluster through raft"

    "error":"etcdserver: request timed out"

    "msg":"60cc8f2283ba4868 [logterm: 2, index: 52859795] sent MsgPreVote request to d8defb09eb99f3b7"

    "output":"{\"health\":\"false\",\"reason\":\"RAFT NO LEADER\"}"

Environment

VMware vSphere Kubernetes Service

Cause

The cluster suffered a Quorum Loss in the etcd database. During the upgrade transition, the 1st node attempted to initiate Raft elections and communicate with the 2nd node, but failed to receive votes or reach consensus. This is common when a two-node state exists momentarily during an upgrade, where the loss of one node (or communication to it) prevents the remaining node from reaching a majority (2 out of 2).

Resolution

Reference: Recovery ETCD Quorum Loss for VKS Cluster