During a TKGm Workload Cluster upgrade, the following symptoms are observed:
# ssh capv@<node-ip># sudo -i# journalctl -u containerd"PullImage \"projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha>\" failed" error="rpc error: code = Canceled desc = failed to pull and unpack image \"projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha>\": context canceled""crictl images" output in the node doesn't show any Antrea image."crictl pull projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha>" in the node hangs and doesn't complete."projects.registry.vmware.com" is reachable through ping and curl commands from the problematic node.A possible cause for this issue is underlying networking troubles in the ESXi host where the stuck node is hosted.
The fact that other ControlPlane/Worker nodes were successfully upgraded usually indicates networking at cluster level works fine.
Since the registry is reachable, it is possible that containerd is timing out when trying to pull the image due to slow connectivity with the registry.
To verify connectivity issues with the registry based on the described symptoms, follow these steps:
# kubectl patch cluster <Workload Cluster> --type merge -p '{"spec":{"paused": true}}' -n <namespace># ssh capv@<node-ip># sudo -i# ctr pull projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha># kubectl patch cluster <Workload Cluster> --type merge -p '{"spec":{"paused": false}}' -n <namespace>If slow connectivity with the registry is detected from the problematic VM, but not from other VMs, this could indicate an issue with the underlying ESXi host.
On vSphere Client, migrate the VM to a different ESXi host following How to Migrate Your Virtual Machine to a New Compute Resource.
Once migrated, reboot the VM and verify the status. If containerd still complains about Antrea image pull, try the above "ctr pull" command again and verify the connectivity with the registry.
If the issue is resolved after migrating the VM to a different ESXi host, this most likely means there's an issue with the original ESXi host's networking.
It is recommended to place the problematic ESXi host in Maintenance Mode after migrating/powering off all its hosted VMs: Place an ESXi Host in Maintenance Mode in the VMware Host Client
From vSphere Client: right click on the ESXi host > Maintenance Mode > Enter Maintenance Mode