Stale ETCD member prevents vSphere with Tanzu Guest Cluster Upgrade
search cancel

Stale ETCD member prevents vSphere with Tanzu Guest Cluster Upgrade

book

Article ID: 319400

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

Symptoms:

  • TKGS Guest Cluster is stuck in Updating phase.
  • New Control Plane VM is deployed, powered on, and partially configured, but never goes to ready state and is repeatedly redeployed. 
  • On vCenter /var/log/vmware/wcp/wcpsvc.log, the following messages appear:
PowerState:poweredOn Phase:Created Conditions:[{Type:Ready Status:False Severity:Info LastTransitionTime:2023-01-01 14:06:23 +0000 UTC Reason:NotReady Message:VM <NAMESPACE>/<CLUSTER>-control-plane-abc65 doesn't have an IP assigned}
 
  • From an SSH session to the new CP node that is failing to join, the following messages appear in /var/log/cloud-init-output.log
+++ [2023-01-01 14:06:23+00:00] running 'kubeadm join phase control-plane-join etcd'



I0101 14:06:23.485076    2076 local.go:148] creating etcd client that connects to etcd pods
I0101 14:06:23.504109    2076 etcd.go:101] etcd endpoints read from pods: https://10.244.0.10:2379,https://10.244.0.11:2379
I0101 14:06:32.378923    2076 etcd.go:247] etcd endpoints read from etcd: https://10.244.0.10:2379,https://10.244.0.11:2379
I0101 14:06:32.378947    2076 etcd.go:119] update etcd endpoints: https://10.244.0.10:2379,https://10.244.0.11:2379
I0101 14:06:32.378957    2076 local.go:156] [etcd] Getting the list of existing members
I0101 14:06:32.385975    2076 local.go:164] [etcd] Checking if the etcd member already exists:
https://10.244.0.14:2380
I0101 14:06:32.386025    2076 local.go:179] [etcd] Adding etcd member: https://10.244.0.14:2380

{"level":"warn","ts":"2023-01-01T14:06:32.398Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-b3f8cf7f-a54e-4c85-a0a9-21874e0a4742/10.244.0.10:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
 
 
  • From an SSH session to the existing, functional CP nodes, the following messages appear in /var/log/pods/kube-system_etcd-<CLUSTER_NAME>-control-plane-6klxs_<ETCD_CONTAINER_ID>/etcd/0.log:
2023-01-01T14:06:27.88448600Z stderr F 2023-01-01 14:06:27.884486 I | embed: rejected connection from "10.244.0.14:50840" (error "EOF", ServerName "")
2023-01-01T14:06:27.88448633Z stderr F 2023-01-01 14:06:27.884391 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2023-01-01T14:06:27.88462487Z stderr F 2023-01-01 14:06:27.884445 W | etcdserver: not enough started members, rejecting member add {ID:507ef0b3a3472f4b RaftAttributes:{PeerURLs:[https://10.244.0.14:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2023-01-01T14:06:27.909179548Z stderr F 2023-01-01 14:06:27.909094 W | etcdserver: failed to reach the peerURL(https://10.244.0.13:2380) of member b505642f490d572(Get "
https://10.244.0.13:2380/version": dial tcp 10.244.0.13:2380: i/o timeout)
 
  • In the above logging, we can see that the new member IP is 10.244.0.14 and the logging on the existing, good Control Plane nodes indicates a failure to reach 10.244.0.13. We can also see that this is a cluster configured with 3 Control Plane nodes.



Environment

VMware vSphere 7.0 with Tanzu
VMware vSphere 8.0 with Tanzu

Cause

ETCD databases require quorum for functionality. Members will not be added to a cluster that don't have good quorum. Additionally, adding a new member to a cluster will not be allowed if it forces the cluster out of good quorum status. ETCD clusters are built with an odd number of nodes. For a 3 node ETCD cluster, if 1 node is down, the cluster still has a good quorum of 2 nodes; however, a new node will not be able to join until the 3rd node is functional, or is removed.

Resolution

To resolve this failure condition, the stale ETCD member must be deleted from the cluster on the functional Guest Cluster Control Plane nodes. The following steps will detail the process of identifying the stale node, then removing it.


1. SCOPE AND IDENTIFY THE PROBLEM MEMBER

# sudo su
  • Gather the ETCD container ID:
# crictl ps
  • Create an Alias to reference this containerID when running the "etcdctl" command:
# alias etcdctl='crictl exec <ETCD_CONTAINER_ID FROM CRICTL PS OUTPUT> etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt'
  • List ETCD members:
# etcdctl member list
 
Example Output:
# etcdctl member list
b505642f490d572, unstarted, , https://10.244.0.13:2380, false
15837190f1c0921b, started, test-cluster-control-plane-6klxs, https://10.244.0.10:2380, https://10.244.0.10:2379, false
6b68c8f8f1a74b37, started, test-cluster-control-plane-xlz7q, https://10.244.0.11:2380, https://10.244.0.11:2379, false

 
2. REMOVE THE PROBLEM MEMBER FROM ETCD CLUSTER
  • Use etcd member remove to remove the problem node with the following command:
# etcdctl member remove b505642f490d572

Example Output:
# etcdctl member remove b505642f490d572
 Member b505642f490d572 removed from cluster d78b4e500f76b

 



Additional Information

Impact/Risks:
When a stale node is present in an ETCD cluster but is unreachable, it will prevent new members from joining the cluster. This will cause new Control Plane nodes to deploy, but fail cluster join operations. Because the new nodes can't join the cluster, they never enter healthy status and will be deleted and recreated repeatedly until the ETCD problems are corrected.