Guest cluster upgrade stuck with node in NotReady state in vSphere Kubernetes Service
search cancel

Guest cluster upgrade stuck with node in NotReady state in vSphere Kubernetes Service

book

Article ID: 440423

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • A newly deployed Control Plane node remains in a NotReady state.

  • Verifying kubelet logs for the impacted node:

    journalctl -xeu kubelet 

    Ready            False        KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

    kubelet[1445]: E0506   1445 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"<pod name>\" with ImagePullBackOff: \"Back-off pulling image \\\"localhost:5000/tkg/packages/core/<pod>@sha256:####\\\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \\\"localhost:5000/tkg/packages/core/<pod>@sha256:####\\\": failed to resolve reference \\\"localhost:5000/tkg/packages/core/antrea@sha256:####\\\": localhost:5000/tkg/packages/core/<pod>@sha256:####: not found\"" pod="kube-system/<pod name>" podUID="<pod ID>"

  • The machine and VM object are in Runnings state.

  • The node has successfully joined the etcd cluster. 

Environment

vSphere Kubernetes Service

Cause

The issue occurs when a new Control Plane node joins the cluster but fails to initialize the network plugin or pull required container images. This behavior is typically caused by the node lacking necessary Kubernetes roles or taints, preventing the capi-controller-manager and clusterbootstrap controllers from correctly identifying and configuring the node during the rollout sequence.

Resolution

Manually apply the missing roles and taints to the affected Control Plane node:

  • kubectl label node <node name> node-role.kubernetes.io/control-plane=""
  • kubectl label node <node name> node-role.kubernetes.io/master=""
  • kubectl taint node <node name> node-role.kubernetes.io/control-plane="":NoSchedule