Workload Cluster Upgrade Stuck or Change Stuck at New Control Plane VM due to ESXi Join Time-outs

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After initiating a workload cluster upgrade or workload cluster change to control plane nodes, the newest control plane VM reaches Running state but does not progress.

While connected to the Supervisor cluster context, the following symptoms are observed:

The new control plane machine reaches Running state:
```
kubectl get machine -n <workload cluster namespace>
```
However, the kubeadm control plane (KCP) object which manages the control plane nodes states that the new control plane VM is unhealthy and unavailable:
```
kubectl get kcp -n <workload cluster namespace>
```

Describing the cluster shows that Remediation Failed for the new control plane node:

kubectl describe cluster <workload cluster name> -n <workload cluster namespace>

Remediation Failed @ <new control plane VM>

While connected to the workload cluster context, the following symptoms are observed:

The new control plane node appears in the list of nodes but has no Roles assigned, showing ROLES <none>:

kubectl get nodes

NAME                           STATUS      ROLES
<new control plane node>       Ready       <none>

There are no pods running on the new control plane node:

kubectl get pods -A -o wide | grep <new control plane node>

While SSH to the new control plane node, the following symptoms are observed:

There are no running containers on the new control plane node:
```
crictl ps -a
```

cloud-init-output logs show errors similar to the following, indicating that the new control plane node failed to join ETCD quorum:

cat /var/log/cloud-init-output

adding etcd member as learner
retrying of unary invoker failed
error execution phase control-plane-join-etcd: error creating local static pod manifest file: etcdserver: Peer URLs already exists
kubeadm join phase control-plane-join etcd failed
fatal error, exiting
removing member from cluster status

promoting a learner as a voting member
retrying of unary invoker failed
etcdserver: can only promote a learner member which is in sync with leader
etcdserver: request timed out, waiting for the applied index took too long
etcdserver: can only promote a learner member 
error execution phase control-plane-join-etcd: error creating local static pod manifest file: etcdserver: can only promote a learner member
kubeadm join phase control-plane-join etcd failed
fatal error, exiting
removing member from cluster status

The new control plane node can successfully reach the workload cluster's VIP and the ETCD and kube-apiserver of other control plane nodes in the workload cluster.

Environment

vSphere Supervisor

This issue can occur regardless of whether or not the workload cluster is managed by Tanzu Mission Control (TMC)

Cause

When a new control plane node is created, it pulls ETCD database information from existing healthy control plane nodes by joining the existing healthy ETCD quorum.

However, ETCD has a time-out of 2 minutes by default which means that if the new control plane node fails to join after 2 minutes, the system will consider the new control plane node as a failure.

As per cloud-init-output logs, if the kubeadm join phase to etcd step fails, clean-up operations will initiate and tear down all containers in the new control plane node.

Kubernetes manifests and ETCD data will be removed from the new control plane node that was detected as a failure in preparations for recreating it.

It has been observed in a single control plane node cluster that the system may not be able to properly recreate the failed new control plane node and it will be left in Running state with Remediation Failed status.

High resource usage or networking slowness can cause time-outs with the ETCD new member join process

Resolution

IMPORTANT: It is not appropriate to delete an older control plane VM to progress the upgrade or rolling redeployment. This can cause data loss and potential destruction of the cluster's entire database.

The cause of the ETCD join time-outs will need to be determined and resolved.

This can be due to heavy resource usage on the other control plane node(s) in the workload cluster, or networking issues between the existing control plane node(s) and the new control plane node.

While connected into the workload cluster context, you can use kubectl top to isolate heavy resource usage:

kubectl top pods -A --sort-by=memory

You can also trace which node that a pod is running on by using the -o wide flag:

kubectl get pod <pod name> -n <pod namespace> -o wide

Any customer application pods using too much resources on the control plane nodes can be temporarily scaled down to allow for the next control plane node recreation to successfully join ETCD.

Networking tests should be performed between the control plane nodes in the workload cluster to determine if there is latency that would cause the ETCD join process to take longer than 2 minutes.

kube-apiserver uses port 6443. ETCD uses ports 2379 and 2380.
By default, ping is disabled on vSphere Supervisor VMs.

Additional Information

　The new control plane node may show that it joined ETCD quorum successfully when querying through etcdctl but due to the above time-out issue, cloud-init-output will detect it as a failure and tear down the new node.
Japanese version: ESXi 参加タイムアウトにより、新しいコントロールプレーン VM でワークロードクラスタのアップグレードが停止、または変更が停止する。