During a VKS cluster upgrade or initial zonal deployment, a control plane node may become stuck in the “Provisioning” phase, preventing the cluster from completing. The node successfully registers with etcd as a "learner" (a read-only replica) but is never promoted to a full voting member, leaving the cluster unable to progress without manual intervention.
You are likely hitting this bug if ALL of the following are true:
Where to look:
VKr v1.35.2 and earlier (within the v1.35.x line)
VKr v1.34.5 and earlier (within the v1.34.x line)
When a new control plane node joins an etcd cluster, it first enters a "learner" state where it syncs data but cannot vote. Once synced, kubeadm promotes it by calling the etcd MemberList API. Due to an upstream Kubernetes bug in the embedded etcd gRPC client, the MemberList call can accidentally be sent to the new learner node itself instead of to one of the existing voting members. The learner correctly rejects this call ("rpc not supported for learner"), but kubeadm treats the rejection as a fatal error, waits 2 minutes, then gives up — leaving the node permanently stuck as an unpromotable learner.
VKr 1.35.5 and 1.34.8 contain the fix for the issue mentioned in this KB.
---
Recovery requires two actions, performed in order:
(a) Remove the stuck etcd learner member from the VKS cluster.
(b) Ask Cluster-API to safely replace the failed control plane Machine by adding a remediation annotation to it.
Using the annotation (instead of deleting the Machine directly) is the recommended approach because:
Step 1. Identify the stuck etcd learner member ID.
kubectl --kubeconfig=<vks-cluster-kubeconfig> \
get pods -n kube-system -l component=etcd
matches the stuck control plane node), then list members from
inside it:
kubectl --kubeconfig=<vks-cluster-kubeconfig> \
exec -n kube-system <healthy-etcd-pod> -- \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member list -w table
Note the ID (hex) of any member where IS LEARNER = true.
Step 2. Remove the learner from etcd, executing from the SAME healthy
voting pod you used in Step 1b:
kubectl --kubeconfig=<vks-cluster-kubeconfig> \
exec -n kube-system <healthy-etcd-pod> -- \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member remove <learner-id-in-hex>
Note: During this step, you may encounter the following error: `Error from server: etcdserver: rpc not supported for learner`, as the etcd client may still attempt to execute the operation against a learner node. You may also encounter errors related to the API Server, as the API Server on the second node may not yet be ready. If you encounter any of the above issues, simply repeat the command.
Step 3. Trigger Cluster-API to recreate the failed control plane node.
kubectl get machines.cluster.x-k8s.io -n <namespace> \
-l cluster.x-k8s.io/cluster-name=<cluster-name>
The "NODENAME" column should match the learner member name you
removed in Step 2.
kubectl annotate machine -n <namespace> <machine-name> \
cluster.x-k8s.io/remediate-machine=""
Cluster-API will recreate the Machine automatically. No delete permission is required.
Step 4. Monitor the new node as it joins. If the join fails again (the bug can recur), repeat steps 1–3.