Recover Guest cluster after a Control plane node was deleted manually.

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Control plane nodes were deleted manually.
"Role" for the Control Plane node shows as <none> while running kubectl get nodes -A command.

The etcd output displays three Control Plane (CP) VMs listed under etcd. However, one of the entries shows a status of "context deadline exceeded," corresponding to a VM that was previously deleted.

capw-controller-manager logs on Supervisor has below entries:

controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed attempt to reconcile etcd members: cluster has fewer than 2 control plane nodes; removing an etcd member is not supported" "name"="<Control Plane VM name>" "namespace"="<namespace>" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane".

Environment

vSphere Kubernetes Service 8.x
vSphere Kubernetes Service 7.x

Cause

During the upgrade process, manually deleting a Control Plane node is not recommended, as it can lead to potential etcd data loss. In this case, a Control Plane was manually deleted, causing the update to stall for the guest cluster. The etcd member removal process requires a minimum of two healthy etcd members to proceed and ensure a successful upgrade.

Resolution

Step 1:

- Restart the kubelet on the newly created control plane node "systemctl restart kubelet". The node will be re-added to the cluster (k8s layer).

Step 2:
- Edit the newly created control plane node add the missing annotation, label, and taint.
- Compare the yaml files of unhealthy node with the healthy node to add the missing parameters such as taints and labels.
- Run the following command kubectl get no <node-name> -o yaml , and look for the following labels and taints.

Label: node-role.kubernetes.io/control-plane:""

Annotation: kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock

Taint: node-role.kubernetes.io/master:NoSchedule

- Taint the node using the following command:

kubectl taint node <node-name> node-role.kubernetes.io/control-plane:NoSchedule

- Label the node using the command:

kubectl label node <node-name> node-role.kubernetes.io/control-plane=""OR

kubectl label node <node-name> node-role.kubernetes.io/master=""

You can choose between the master or control-plane based on what other healthy control plane node is labeled as.

- Verify the node status - it should now report "Ready control-plane,master"

Step 3:

- Once all the required CP nodes are added back to the cluster, check the etcd member, cluster status. SSH to the guest cluster control plane VM and verify the following.

Step 4:

- Make sure the old/deleted control plane VM's are not powered on on the ESXi hosts, if so power them off and delete as per customer confirmation as those VM's are no longer part of the cluster.

Additional Information

Set the etcd alias on the TKC CP VM before running the etcdctl commands:

Identify the current etcdctl location

find / | grep bin | grep etcdctl

Replace the * in the below command with the etcdctl location

alias etcdctl="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt"