You’re upgrading your Supervisor and Guest Clusters from v1.26 to v1.27, with plans to continue testing through v1.28 and 1.29.x. During the upgrade from v1.26 to v1.27, the CSI pods fail to start, and your applications go offline. All CSI pods report repeated image pull failures.
In the kubelet logs, you see the following error:
Error: ErrImagePull
Failed to pull image “localhost:5000/tkg/sandbox/
rpc error: code = NotFound desc = failed to pull and unpack image “localhost:5000/tkg/sandbox/
Attempts to manually place the CSI image into the registry are unsuccessful. The image you provide lacks the proper tags and digests, so Kubernetes is still unable to resolve it. The CSI app remains stuck and does not deploy.
Your CSI app and its associated PKGI are both in a paused state. The CSI app references image version 3.1.0, while the PKGI reports version 3.2.0 with reconcileSucceeded=true. However, neither is actively reconciling.
Based on state inspection and upgrade behavior, you determine that the pause was introduced manually during or before the 1.26 lifecycle. The paused state prevents controllers from finding the connect image on the worker nodes. This causes your controller pods to remain unscheduled, and CSI functionality fails.
When you attempt to confirm the origin of the pause, you find that kube-apiserver audit logs are missing from the WCP support bundle. This prevents definitive tracing of the event, but you have a high degree of certainty that a manual pause was the root cause.
Manually unpause the CSI app and PKGI object. This action allows reconciliation to complete successfully.
kubectl -n <namespace> patch app <app name> -p '{"spec":{"paused":false}}' --type=merge
kubectl -n <namespace> patch pkgi <pkgi name> -p '{"spec":{"paused":false}}' --type=merge
After unpausing:
Confirm that the pause state does not revert during subsequent upgrades and continue to monitor the CSI app’s reconciliation health.