VKS control plane nodes are stuck in "Deleting" state with error "Waiting for pre-terminate hooks to succeed (hooks: pre-terminate.delete.hook.machine.cluster.x-k8s.io/kcp-cleanup)"
search cancel

VKS control plane nodes are stuck in "Deleting" state with error "Waiting for pre-terminate hooks to succeed (hooks: pre-terminate.delete.hook.machine.cluster.x-k8s.io/kcp-cleanup)"

book

Article ID: 419762

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

  • One of the VKS guest cluster control plane node is stuck in "Deleting" state.

  • The describe output of the machine and KCP confirms that a "PreTerminateHook" is applied as a part of the KCP cleanup.

      * EtcdMemberHealthy: Machine is deleting
      * Machine <node-name>:
      * Deleting: Machine deletion in progress since more than 15m, stage: WaitingForPreTerminateHook
      * Control plane components: Machine is deleting
      * EtcdMemberHealthy: Machine is deleting

          Message:               Waiting for pre-terminate hooks to succeed (hooks: pre-terminate.delete.hook.machine.cluster.x-k8s.io/kcp-cleanup)
          Observed Generation:   4
          Reason:                WaitingForPreTerminateHook
          Status:                True
          Type:                  Deleting

  • Per CAPI, there is no issue with the pods not getting drained from the affected node or a PDB (Pod Disruption Budget) violation halting the drain process.

  • Per KCP, it is unable to move the ETCD leadership to the other healthy Control Plane Node in the cluster.

    "Reconciler error" err="failed to move leadership to candidate Machine <Node-name>: failed to create etcd client: etcd leader is reported as <etcd-ID of the node> with name \"<Node-name>\", but we couldn't find a corresponding Node in the cluster" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="<namespace>/<name of the KCP>" namespace="<namespace>" name="<name of the KCP>" reconcileID="<ID>" 

  • Checking the etcd status from within the cluster confirms all the nodes to be in a "healthy" state and the etcd quorum is intact too.

  • The affected node is the etcd leader.

  • All the system pods in the cluster and especially on the affected nodes are up and running.

  • Inside the Guest Cluster, the affected node is in a "Ready,SchedulingDisabled" state. On top of this, the role/ label "control-plane" is missing from the affected node. Below is how the output of command "kubectl get nodes" looks like.

    Test-CP-node1                            Ready,SchedulingDisabled   <none>          
    Test-CP-node2                            Ready                      control-plane   
    Test-CP-node3                            Ready                      control-plane

  • The logs inside /var/log/cloud-init-output.log of the affected node confirm that the bootstrap was successful. However, the labels and taints associated with a "Control Plane" Node were never applied to the node. Below is how it looks like under normal circumstances (highlighted in bold).

    * Certificate signing request was sent to apiserver and approval was received.
    * The Kubelet was informed of the new secure connection details.
    * Control plane label and taint were applied to the new node.
    * The Kubernetes control plane instances scaled up.
    * A new etcd member was added to the local/stacked etcd cluster.




Environment

VMware vSphere Kubernetes Service

Cause

In Kubernetes, a Control Plane node has some specific roles/labels and taints associated with it which differentiate it from other worker/workload nodes in the cluster. CAPI/KCP also relies on these roles/labels to identify a "Control Plane" node within the cluster and take any necessary action required on it. Below are the minimum expected Labels and Taints on a Control Plane node.

Labels: 
        node-role.kubernetes.io/control-plane=
        node-role.kubernetes.io/master=

Taints: node-role.kubernetes.io/control-plane:NoSchedule

In the event these labels are missing, CAPI is not able to recognize the node and therefore is not able to perform a re-conciliation/remediation on it, if required.

Resolution

Add the missing role/label on the Control Plane node for CAPI to successfully complete its reconciliation. The same shall remove the concerned node gracefully and spin up a new node in its replacement without messing up the etcd quorum. 

  1. The missing label(s) and taint(s) can be added through the below set of command, depending on what label(s) and taint(s) is missing from the node. 

    kubectl label node <node name> node-role.kubernetes.io/control-plane=""
    kubectl label node <node name> node-role.kubernetes.io/master=""
    kubectl taint node <node name> node-role.kubernetes.io/control-plane="":NoSchedule


  2. Adding the above missing label(s) and taint(s) should allow CAPI to take care of the cluster reconciliation. Allow it some time. In case the node(s) is still stuck in "deleting" state, restart the KCP pods on the supervisor cluster using the command below.

    root@<ID> [ ~ ]# k rollout restart deployment capi-kubeadm-control-plane-controller-manager -n svc-tkg-domain-c<ID>
    deployment.apps/capi-kubeadm-control-plane-controller-manager restarted

Once the same is done, the cluster re-conciliation should start, and the Control Plane node stuck in "Deleting" should be gone and replaced with a new node altogether.