After initiating a workload cluster upgrade or workload cluster change to control plane nodes, the newest control plane VM reaches Running state but does not progress.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get machine -n <workload cluster namespace>
kubectl get kcp -n <workload cluster namespace>
kubectl describe cluster <workload cluster name> -n <workload cluster namespace>
Remediation Failed @ <new control plane VM>
While connected to the workload cluster context, the following symptoms are observed:
kubectl get nodes
NAME STATUS ROLES
<new control plane node> Ready <none>
kubectl get pods -A -o wide | grep <new control plane node>
While SSH to the new control plane node, the following symptoms are observed:
crictl ps -a
cat /var/log/cloud-init-output
adding etcd member as learner
retrying of unary invoker failed
error execution phase control-plane-join-etcd: error creating local static pod manifest file: etcdserver: Peer URLs already exists
kubeadm join phase control-plane-join etcd failed
fatal error, exiting
removing member from cluster status
promoting a learner as a voting member
retrying of unary invoker failed
etcdserver: can only promote a learner member which is in sync with leader
etcdserver: request timed out, waiting for the applied index took too long
etcdserver: can only promote a learner member
error execution phase control-plane-join-etcd: error creating local static pod manifest file: etcdserver: can only promote a learner member
kubeadm join phase control-plane-join etcd failed
fatal error, exiting
removing member from cluster status
vSphere Supervisor
This issue can occur regardless of whether or not the workload cluster is managed by Tanzu Mission Control (TMC)
When a new control plane node is created, it pulls ETCD database information from existing healthy control plane nodes by joining the existing healthy ETCD quorum.
However, ETCD has a time-out of 2 minutes by default which means that if the new control plane node fails to join after 2 minutes, the system will consider the new control plane node as a failure.
As per cloud-init-output logs, if the kubeadm join phase to etcd step fails, clean-up operations will initiate and tear down all containers in the new control plane node.
Kubernetes manifests and ETCD data will be removed from the new control plane node that was detected as a failure in preparations for recreating it.
It has been observed in a single control plane node cluster that the system may not be able to properly recreate the failed new control plane node and it will be left in Running state with Remediation Failed status.
High resource usage or networking slowness can cause time-outs with the ETCD new member join process
IMPORTANT: It is not appropriate to delete an older control plane VM to progress the upgrade or rolling redeployment. This can cause data loss and potential destruction of the cluster's entire database.
The cause of the ETCD join time-outs will need to be determined and resolved.
This can be due to heavy resource usage on the other control plane node(s) in the workload cluster, or networking issues between the existing control plane node(s) and the new control plane node.
While connected into the workload cluster context, you can use kubectl top to isolate heavy resource usage:
kubectl top pods -A --sort-by=memory
You can also trace which node that a pod is running on by using the -o wide flag:
kubectl get pod <pod name> -n <pod namespace> -o wide
Any customer application pods using too much resources on the control plane nodes can be temporarily scaled down to allow for the next control plane node recreation to successfully join ETCD.
Networking tests should be performed between the control plane nodes in the workload cluster to determine if there is latency that would cause the ETCD join process to take longer than 2 minutes.