Stale ETCD member prevents vSphere Workload Cluster Upgrade

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

A workload cluster upgrade is stuck at a new control plane node that never reaches Running state on the desired version.

While SSH directly into the new control plane node, the following error message is present:

cat /var/log/cloud-init-output.log

running 'kubeadm join phase control-plane-join etcd'
…
…
…
local.go:148] creating etcd client that connects to etcd pods
etcd.go:101] etcd endpoints read from pods: https://<CP node IP A>:2379,https://<CP node IP B>:2379
etcd.go:247] etcd endpoints read from etcd: https://<CP node IP A>:2379,https://<CP node IP B>:2379
etcd.go:119] update etcd endpoints: https://<CP node IP A>:2379,https://<CP node IP B>:2379
local.go:156] [etcd] Getting the list of existing members
local.go:164] [etcd] Checking if the etcd member already exists: https://<new CP node IP>:2380
local.go:179] [etcd] Adding etcd member: https://<new CP node IP>:2380
"caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-<id>/<CP node IP A>:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}

While connected into the upgrading workload cluster context, the following errors are present:

kubectl get pods -A | grep etcd

kubectl logs <etcd pod name> -n kube-system

stderr F YYYY-MM-DD HH:MM:SS.sssss I | embed: rejected connection from "<new CP node IP>:50840" (error "EOF", ServerName "")
stderr F YYYY-MM-DD HH:MM:SS.sssss W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
stderr F YYYY-MM-DD HH:MM:SS.sssss W | etcdserver: not enough started members, rejecting member add {ID:<etcd member id> RaftAttributes:{PeerURLs:[https://<new CP node IP>:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
stderr F YYYY-MM-DD HH:MM:SS.sssss W | etcdserver: failed to reach the peerURL(https://<CP node IP A or B>:2380) of member <etcd member id>(Get "https://<CP node IP A or B>:2380/version": dial tcp <CP node IP A or B>:2380: i/o timeout)

Environment

vSphere Supervisor

Workload cluster with 3 Control Plane Nodes

Cause

In this scenario, ETCD has a stale member in its full 3/3 quorum and it is refusing the new control plane node to be added into its quorum.

ETCD is a critical system process that maintains the database for the workload cluster.
It requires being able to function in a 3/3 full quorum when the workload cluster is set to have 3 control plane nodes.
One ETCD process runs on each control plane node to maintain and sync the database across the workload cluster.
However, if ETCD quorum considers itself to have a full quorum of 3 members, it will not accept a 4th member.

Resolution

If the errors in the Issue/Introduction match your workload cluster's upgrade stuck scenario exactly, please reach out to VMware by Broadcom Technical Support referencing this KB article.

If the errors do not match exactly, you are encountering a very different issue than this KB article.

Additional Information

Impact/Risks:
When a stale node is present in an ETCD cluster but is unreachable, it will prevent new members from joining the cluster. This will cause new Control Plane nodes to deploy, but fail cluster join operations. Because the new nodes can't join the cluster, they never enter healthy status and will be deleted and recreated repeatedly until the ETCD problems are corrected.