Upgrade of VKS guest cluster stuck as a control plane node cannot be found

search cancel

Upgrade of VKS guest cluster stuck as a control plane node cannot be found

book

Article ID: 414462

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

You upgrade a vSphere Kubernetes Service (VKS) guest cluster and the upgrade gets stuck updating control plane nodes.

Cluster will have following error:

Control plane components: Node xyz does not exist

You fetch all control plane machines and see that the machine xyz does exists and is in a running state.

kubectl get machine -n <NAMESPACE> | grep -v worker

Use kubeconfig of the cluster to look at the nodes:

export TKC=<YOUR GUEST CLUSTER NAME> NS=<CLUSTER'S NAMESPACE>
kubectl get secret -n ${NS} ${TKC}-kubeconfig -o jsonpath={.data.value} | base64 -d > ${TKC}-kubeconfig
KUBECONFIG=${TKC}-kubeconfig kubectl get nodes
${KUBECONFIG} kubectl get nodes --show-labels| grep -v "worker"

You will see the node is Ready but is not marked as Control Plane. Also, this node will not have "node-role.kubernetes.io/control-plane=" label on it.

ssh into the control plane node (see the Additional Information section below) and tail /var/log/clou-init-output.log. You will see error similar to:

[####-##-## ##:##:##] [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
[####-##-## ##:##:##] I0913 14:56:41.226935 1926 kubelet.go:332] [kubelet-start] preserving the crisocket >information for the node
[####-##-## ##:##:##] I0913 14:56:41.227454 1926 patchnode.go:32] [patchnode] Uploading the CRI socket >"unix:///var/run/containerd/containerd.sock" to Node "cluster-name-xyz" as an annotation
[####-##-## ##:##:##] error execution phase kubelet-wait-bootstrap: error writing CRISocket for this node: Unauthorized
[####-##-## ##:##:##] To see the stack trace of this error execute with --v=5 or higher
[####-##-## ##:##:##] + set -xe
[####-##-## ##:##:##] + touch /root/kubeadm-complete
[####-##-## ##:##:##] + vmware-rpctool 'info-set guestinfo.kubeadm.phase complete'

This means that there was an error during cloud init and kubeadm join failed.

Environment

All Kubernetes clusters can have this issue

Resolution

Please contact Broadcom support for assistance.

Additional Information

Feedback

thumb_up Yes

thumb_down No