Post etcd repair KCP won't allow deletion of the control plane nodes as a part of cluster reconcile/remediation.
search cancel

Post etcd repair KCP won't allow deletion of the control plane nodes as a part of cluster reconcile/remediation.

book

Article ID: 427318

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • In a VKS Guest Cluster (Workload Cluster), a Control Plane node may get stuck in a Remediating state following a storage or network outage.

  • Guest Cluster status.conditions shows WorkersAvailable as False and ControlPlaneMachinesReady as False.

  • Control Plane Machine is stuck in the Deleting phase.

  • KCP (KubeadmControlPlane) pod logs in the Supervisor Cluster report the following error:

    "Reconciler error" err="failed to remove etcd member for deleting Machine ... cluster has fewer than 2 control plane nodes; removing an etcd member is not supported"

Environment

VMware vSphere Kubernetes Service

Cause

The KCP controller includes a safety guardrail that prevents removing an etcd member if the cluster size drops below 2 nodes. If multiple control plane nodes fail simultaneously (due to infrastructure outage) and are manually repaired or replaced, KCP detects an inventory mismatch. It refuses to delete the "phantom" Machine object because it cannot safely execute the etcd member remove command against a non-existent quorum, resulting in a reconciliation loop.

Resolution

This is a known issue. A permanent fix is scheduled for VKS 3.7.

Workaround: To break the reconciliation loop, you must trick KCP into believing the node exists so it can proceed with the deletion logic.

  1. Identify the Node Name: On the Supervisor Cluster, identify the nodeRef for the Machine stuck in Deleting status:

    Bash
     
    kubectl get machine <machine-name> -n <namespace> -o jsonpath='{.status.nodeRef.name}'
    

     

  2. Create a Dummy Node Object: On the Guest Cluster, create a temporary local Node object using the name retrieved in Step 1.

    YAML
     
    apiVersion: v1
    kind: Node
    metadata:
      labels:
        node-role.kubernetes.io/control-plane: ""
      name: <node-name-from-step-1>
    spec: {}
    

    Apply this via kubectl apply -f dummy-node.yaml.

  3. Monitor Deletion: Once the dummy node is created, the KCP controller attempts to reconcile and proceed with the machine object deletion and etcd member removal logic. Once the Machine object is gone from the Supervisor, KCP automatically scales up new control plane nodes to meet the desired replica count.