VKS Clusters fail to deploy with error "rpc not supported for learner"
search cancel

VKS Clusters fail to deploy with error "rpc not supported for learner"

book

Article ID: 439838

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

During a VKS cluster upgrade or initial zonal deployment, a control plane node may become stuck in the “Provisioning” phase, preventing the cluster from completing. The node successfully registers with etcd as a "learner" (a read-only replica) but is never promoted to a full voting member, leaving the cluster unable to progress without manual intervention.

How to identify this issue

You are likely hitting this bug if ALL of the following are true:

  • A VKS cluster upgrade or creation is stuck or has stalled.
  • The cluster has 3 control plane nodes configured. 
  • One or more of the following error messages appear in Kubernetes events, or in the kubeadm cloud-init log on the affected control plane machine:
    •  "1 of 2 etcd members is healthy, 1 learner etcd member, at least 1 healthy member required for etcd quorum."(This error occurs when the 2nd control plane node fails to join the etcd server)
    •  "ControlPlaneAvailable: Failed to get etcd members"
    •  "error: rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"
    •  "error execution phase etcd-join: error creating local etcd static pod manifest file: failed to get member XYZ status: etcdserver: rpc not supported for learner" (It is not just the 2nd node that might fail to join, the 3rd one is just as likely to fail)

 

Where to look:

  • Kubernetes events on the supervisor cluster for the affected vks cluster namespace.
  • Kubeadm log messages in the cloud-init log on the new (stuck) control plane VM.
  • Cluster-API machine status (kubectl get machines -n <namespace> -o yaml).

Environment

VKr v1.35.2 and earlier (within the v1.35.x line)
VKr v1.34.5 and earlier (within the v1.34.x line)

Cause

When a new control plane node joins an etcd cluster, it first enters a "learner" state where it syncs data but cannot vote. Once synced, kubeadm promotes it by calling the etcd MemberList API. Due to an upstream Kubernetes bug in the embedded etcd gRPC client, the MemberList call can accidentally be sent to the new learner node itself instead of to one of the existing voting members. The learner correctly rejects this call ("rpc not supported for learner"), but kubeadm treats the rejection as a fatal error, waits 2 minutes, then gives up — leaving the node permanently stuck as an unpromotable learner.

 

Resolution

VKr 1.35.5 and 1.34.8 contain the fix for the issue mentioned in this KB.

---

Workaround for previous versions

Recovery requires two actions, performed in order:

 

 (a) Remove the stuck etcd learner member from the VKS cluster.

 (b) Ask Cluster-API to safely replace the failed control plane Machine by adding a remediation annotation to it.

 

Using the annotation (instead of deleting the Machine directly) is the recommended approach because:

  • Cluster-API performs the replacement gracefully — it cordons, drains, and brings up a new Machine in the correct order.
  • It works for ANY user, including SSO users that do not have permission to delete Machine objects.
  • It works regardless of which VCF / VKS administrator role you are assigned, as long as you can patch Machines in the namespace.

 

 

 Step 1. Identify the stuck etcd learner member ID.

  1. a) List the etcd pods on the workload cluster:

 

kubectl --kubeconfig=<vks-cluster-kubeconfig> \

  get pods -n kube-system -l component=etcd

 

  1. b) Pick a HEALTHY VOTING pod (do NOT pick the pod whose name

            matches the stuck control plane node), then list members from

            inside it:

 

kubectl --kubeconfig=<vks-cluster-kubeconfig> \

  exec -n kube-system <healthy-etcd-pod> -- \

  etcdctl --endpoints=https://127.0.0.1:2379 \

          --cacert=/etc/kubernetes/pki/etcd/ca.crt \

          --cert=/etc/kubernetes/pki/etcd/server.crt \

          --key=/etc/kubernetes/pki/etcd/server.key \

          member list -w table

 

            Note the ID (hex) of any member where IS LEARNER = true.

 

 Step 2. Remove the learner from etcd, executing from the SAME healthy

         voting pod you used in Step 1b:

 

kubectl --kubeconfig=<vks-cluster-kubeconfig> \

  exec -n kube-system <healthy-etcd-pod> -- \

  etcdctl --endpoints=https://127.0.0.1:2379 \

          --cacert=/etc/kubernetes/pki/etcd/ca.crt \

          --cert=/etc/kubernetes/pki/etcd/server.crt \

          --key=/etc/kubernetes/pki/etcd/server.key \

          member remove <learner-id-in-hex>

Note:  During this step, you may encounter the following error: `Error from server: etcdserver: rpc not supported for learner`, as the etcd client may still attempt to execute the operation against a learner node. You may also encounter errors related to the API Server, as the API Server on the second node may not yet be ready. If you encounter any of the above issues, simply repeat the command.

 Step 3. Trigger Cluster-API to recreate the failed control plane node.

  1. a) Find the Machine object on the Supervisor whose node corresponds to the stuck control plane:

 

kubectl get machines.cluster.x-k8s.io -n <namespace> \

         -l cluster.x-k8s.io/cluster-name=<cluster-name>

 

       

     The "NODENAME" column should match the learner member name you

     removed in Step 2.

 

  1. b) Annotate that Machine to request safe remediation:

kubectl annotate machine -n <namespace> <machine-name> \

  cluster.x-k8s.io/remediate-machine=""

 

     Cluster-API will recreate the Machine automatically. No delete permission is required.

         

 Step 4. Monitor the new node as it joins. If the join fails again (the bug can recur), repeat steps 1–3.

Additional Information

References:
 • https://github.com/kubernetes/kubernetes/pull/137251
 • https://github.com/kubernetes/kubernetes/pull/138403
 • https://github.com/etcd-io/etcd/pull/21641