TMC-SM extension pods are stuck in ImagePullBackOff following a Supervisor Cluster upgrade

Products

VMware Tanzu Platform - Kubernetes

Issue/Introduction

After upgrading the Supervisor Cluster, TMC-SM extension deployments fails to start.
Describing any ImagePullBackOff pod shows container image pull failing due to missing authentication.
Namespace affected: svc-tmc-c#
Pods impacted includes the following deployments:
agent-updater, sync-agent, tmc-auto-attach, vsphere-resource-retriever, intent-agent, cluster-health-extension, extension-manager and extension-updater.
Sample error log (kubectl describe / kubelet)

MONTH DD HH:MM:SS <Supervisor_UUID> kubelet[61986]: E1226 HH:MM:SS.856470 61986 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to\"StartContainer\" for \"agentupdater-workload\" with ErrImagePull: \"failed to pull and unpack image \\\"<image_repository>/tmc-sm/tap-tmc-docker-virtual.usw1.packages.broadcom.com/extensions/agent-updater/agentupdater-workload@sha256:<digest>

Environment

TMC-SM

Cause

Although a valid imagePullSecret (tmc-registry-pull) exists, Kubernetes Deployments are NOT referencing it in their pod template.
Therefore kubelet attempted anonymous pulls, resulting in ImagePullBackOff.

Resolution

Validate registry pull secret exists:

kubectl get secrets -n svc-tmc-c#

Note: Expected output should include: tmc-registry-pull kubernetes.io/dockerconfigjson
Confirm which ServiceAccount pods use:

kubectl -n svc-tmc-c# get pod <pod-name> -o jsonpath='{.spec.serviceAccountName}{"\n"}'
If output is blank or default, pods use the default ServiceAccount.
Patch the Default ServiceAccount to include imagePullSecret:

kubectl -n svc-tmc-c# patch serviceaccount default \ -p '{"imagePullSecrets":[{"name":"tmc-registry-pull"}]}'
Validate if the ServiceAccount patch is applied:

kubectl -n svc-tmc-c# get sa default -o yaml | sed -n '/imagePullSecrets/,+5p'
Expected snippet:

imagePullSecrets: - name: tmc-registry-pull
Decode and verify secret content:

kubectl -n svc-tmc-c# get secret tmc-registry-pull \ -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Validate image registry access manually from the node:

crictl pull \ --creds '<robot_user>:<password>' \ <registry>/tmc-sm/tap-tmc-docker-virtual.usw1.packages.broadcom.com/extensions/agent-updater/agentupdater-workload@sha256:<digest>
If pull succeeds, registry is reachable and credentials are valid.
Add imagePullSecrets to Deployment, because ServiceAccount injection is not automatic across TKG/TMC layers. Manually hard-wire the pull secret into each deployment using the following command:

kubectl -n svc-tmc-c# patch deploy <deployment> \ --type='json' \ -p='[{"op":"add","path":"/spec/template/spec/imagePullSecrets","value":[{"name":"tmc-registry-pull"}]}]'
(Repeat for each affected deployment)
Roll out new pods by scaling the deployment using the following command:

kubectl -n svc-tmc-c# scale deploy <deployment-name> --replicas=0 (Scale down)kubectl -n svc-tmc-c# scale deploy <deployment-name> --replicas=2 (Scale up)
Validate:

kubectl get pods -n svc-tmc-c#

All TMC-SM pods should now be Running.

Additional Information

The DaemonSet domain-local-ds is responsible for distributing Harbor registry certificates to the Supervisor control plane nodes.

The issue likely occurs because the existing nodes are replaced with new VMs during the Supervisor upgrade. If the domain-local-ds pods fail to execute correctly on these new nodes, the certificates are not populated.