A workload cluster upgrade is stuck at a new control plane node that never reaches Running state on the desired version.
While SSH directly into the new control plane node, the following error message is present:
cat /var/log/cloud-init-output.log
running 'kubeadm join phase control-plane-join etcd'
…
…
…
local.go:148] creating etcd client that connects to etcd pods
etcd.go:101] etcd endpoints read from pods: https://<CP node IP A>:2379,https://<CP node IP B>:2379
etcd.go:247] etcd endpoints read from etcd: https://<CP node IP A>:2379,https://<CP node IP B>:2379
etcd.go:119] update etcd endpoints: https://<CP node IP A>:2379,https://<CP node IP B>:2379
local.go:156] [etcd] Getting the list of existing members
local.go:164] [etcd] Checking if the etcd member already exists: https://<new CP node IP>:2380
local.go:179] [etcd] Adding etcd member: https://<new CP node IP>:2380
"caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-<id>/<CP node IP A>:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
kubectl get pods -A | grep etcd
kubectl logs <etcd pod name> -n kube-system
stderr F YYYY-MM-DD HH:MM:SS.sssss I | embed: rejected connection from "<new CP node IP>:50840" (error "EOF", ServerName "")
stderr F YYYY-MM-DD HH:MM:SS.sssss W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
stderr F YYYY-MM-DD HH:MM:SS.sssss W | etcdserver: not enough started members, rejecting member add {ID:<etcd member id> RaftAttributes:{PeerURLs:[https://<new CP node IP>:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
stderr F YYYY-MM-DD HH:MM:SS.sssss W | etcdserver: failed to reach the peerURL(https://<CP node IP A or B>:2380) of member <etcd member id>(Get "https://<CP node IP A or B>:2380/version": dial tcp <CP node IP A or B>:2380: i/o timeout)
vSphere Supervisor
Workload cluster with 3 Control Plane Nodes
In this scenario, ETCD has a stale member in its full 3/3 quorum and it is refusing the new control plane node to be added into its quorum.
If the errors in the Issue/Introduction match your workload cluster's upgrade stuck scenario exactly, please reach out to VMware by Broadcom Technical Support referencing this KB article.
If the errors do not match exactly, you are encountering a very different issue than this KB article.