TKGM Workload Cluster upgrade stuck in upgradeStalled state due to new CP VM's stuck Provisioning
search cancel

TKGM Workload Cluster upgrade stuck in upgradeStalled state due to new CP VM's stuck Provisioning

book

Article ID: 380702

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

  • The TKGM workload cluster upgrade will fail to progress and eventually may show "upgradeStalled" in the cluster Status.
  • One new Control Plane node is deployed and powered on, however, it never moves from "Provisioning" state to "Running".
  • The vspherevm and vspheremachine objects associated with the new Control Plane are created, the vspherevm object is powered on and the vspheremachine object has a "PROVIDERID" assigned.
  • The new Control Plane node has no containers running when viewing "crictl ps" command.
  • The new Control Plane node has no kubelet process running when running "systemctl status kubelet".
  • When viewing the containerd status with "systemctl status containerd" command, the service is running, however, it reports errors pulling images for the coredns container.
  • The Workload Cluster uses a private image registry.
  • The /var/log/cloud-init-output.log will report errors like:

    [ERROR ImagePull]: failed to pull image IMAGE_REPO_FQDN/PROJECT/coredns:v1.9.3 vmware.7-fips.1: output: E1024 16:57:44.689441 1881 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Not Found desc = failed to pull and unpack image "IMAGE_REPO_FQDN/PROJECT/coredns:v1.9.3 vmware.7-fips.1\": failed to resolve reference

     

Environment

TKGM Workload clusters may encounter this on upgrade. This has been seen on 2.2 to 2.3 upgrades.

Cause

This failure occurs because the coredns image referenced in the error is not added to the private image registry used by the Workload Cluster.

Resolution

Add the missing image to the private image registry and recreate the failing machine object to progress the upgrade.