VKS Cluster Upgrade Stalls with "etcdserver too many learner members in cluster" error
search cancel

VKS Cluster Upgrade Stalls with "etcdserver too many learner members in cluster" error

book

Article ID: 438089

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • During a Guest Cluster (VKS) upgrade, the process stalls during the Control Plane rollout. A new Control Plane node is provisioned in vCenter and Kubernetes (as a Machine object) but fails to register as a Node object.
  • The cluster remains in a RollingOut phase with older replicas healthy and the new replica stuck in a Provisioned status, blocking further operations.
  • Analysis of the /var/log/cloud-init-output.log on the newly provisioned Control Plane node reveals the following failure during the etcd join phase:

    error execution phase etcd-join: error creating local etcd static pod manifest file: etcdserver: too many learner members in cluster

  • Checking the etcd member list on an existing healthy Control Plane node confirms a stale, unstarted learner entry:
    etcdctl member list
    +-----------------+-----------+--------------------+---------------------------+---------------------------+------------+
    |        ID       |  STATUS   |           NAME     |           PEER ADDRS      |        CLIENT ADDRS       | IS LEARNER |
    +-----------------+-----------+--------------------+---------------------------+---------------------------+------------+
    | <etcd_member_id>|   started | <etcd_member_name> | https://198.51.100.1:2380 | https://198.51.100.1:2379 |      false |
    | <etcd_member_id>|   started | <etcd_member_name> | https://198.51.100.2:2380 | https://198.51.100.2:2379 |      false |
    | <etcd_member_id>|   started | <etcd_member_name> | https://198.51.100.3:2380 | https://198.51.100.3:2379 |      false |
    | <etcd_member_id>| unstarted |                    | https://198.51.100.4:2380 |                           |       true |
    +-----------------+-----------+--------------------+---------------------------+---------------------------+------------+

Environment

  • vSphere Kubernetes Service (VKS)
  • vCenter server 8.x
  • vSphere Supervisor

Cause

  • The etcd quorum contains a stale learner member entry.
  • Etcd architecture restricts the cluster to a single learner member during reconciliation.
  • The presence of this stale entry prevents the new Control Plane node from joining the quorum as a learner, causing the initialization process to exit and preventing the Node object from being created.

Resolution

  1. Connect via SSH to a healthy Control Plane node of the affected guest cluster.

  2. Retrieve the ETCD container ID of a Running ETCD:
    crictl ps --name etcd

  3. Establish the alias for the CLI used to interface with the ETCD database by using the above ETCD container ID:
    alias etcdctl='crictl exec <etcd container id>  etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'
  4. Identify the stale, unstarted learner member ID by running:
    etcdctl member list --write-out=table

  5. Remove the stale learner member from the quorum:
    etcdctl member remove <stale_etcd_member_id>

  6. Restart the etcd process on all healthy Control Plane nodes to refresh the quorum state. Identify the container ID and stop it (the kubelet will automatically restart it):
    crictl stop <Container_ID_of_etcd>

  7. From the Supervisor Cluster context, delete the affected Machine object to trigger a clean cluster API rollout:
    kubectl delete machine <stuck_machine_name> -n <namespace>