TKGM Workload Cluster upgrade stuck in upgradeStalled state due to new CP VM's stuck Provisioning

search cancel

book

calendar_today

VMware Tanzu Kubernetes Grid Management

The TKGM workload cluster upgrade will fail to progress and eventually may show "upgradeStalled" in the cluster Status.
One new Control Plane node is deployed and powered on, however, it never moves from "Provisioning" state to "Running".
The vspherevm and vspheremachine objects associated with the new Control Plane are created, the vspherevm object is powered on and the vspheremachine object has a "PROVIDERID" assigned.
The new Control Plane node has no containers running when viewing "crictl ps" command.
The new Control Plane node has no kubelet process running when running "systemctl status kubelet".
When viewing the containerd status with "systemctl status containerd" command, the service is running, however, it reports errors pulling images for the coredns container.
The Workload Cluster uses a private image registry.
The /var/log/cloud-init-output.log will report errors like:

[ERROR ImagePull]: failed to pull image IMAGE_REPO_FQDN/PROJECT/coredns:v1.9.3 vmware.7-fips.1: output: E1024 16:57:44.689441 1881 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Not Found desc = failed to pull and unpack image "IMAGE_REPO_FQDN/PROJECT/coredns:v1.9.3 vmware.7-fips.1\": failed to resolve reference

TKGM Workload clusters may encounter this on upgrade. This has been seen on 2.2 to 2.3 upgrades.

This failure occurs because the coredns image referenced in the error is not added to the private image registry used by the Workload Cluster.

Add the missing image to the private image registry and recreate the failing machine object to progress the upgrade.

thumb_up Yes

thumb_down No