System pods in a vSphere Supervisor workload cluster are failing to run and remain stuck in CrashLoopBackOff or ContainerCreating state.
This can cause newly created workload cluster system pods or an upgrade of the workload cluster to hang or fail to progress.
While connected to the affected workload cluster's context, the following symptoms are present:
kubectl get pods -A | grep -v Run
kubectl describe pod <pod-name> -n <pod-namespace>
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost:5000/vmware/pause:#.#": failed to pull image "localhost:5000/vmware/pause:#.#": failed to pull and unpack image "localhost:5000/vmware/pause:#.#": failed to resolve reference "localhost:5000/vmware/pause:#.#": failed to do request: Head "http://localhost:5000/v2/vmware.io/": dial tcp 127.0.0.1:5000: connect: connection refused
While SSH to the affected node(s), one or more of the following symptoms are present:
crictl ps -a | grep docker
crictl images list | grep pause
vSphere with Tanzu 7.0
vSphere with Tanzu 8.0
This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)
Containerd cannot start vmware-system or kube-system pods without an available pause image on the affected node.
The noted missing pause image has been garbage collected by known Kubernetes issue related to DiskPressure in the node: https://github.com/kubernetes/kubernetes/issues/81756
This issue can impact any number of nodes in the affected cluster.
Existing pods that had successfully started when the pause image was originally present will be unaffected by this issue until they are recreated.
Please open a ticket to VMware by Broadcom Technical Support referencing this KB for assistance in recovering the pause image and adding a label to prevent garbage collection of the missing images.
NOTE: Because this is a potential garbage collection issue which is fixed by adding a label, all nodes in the cluster (including any future nodes!) will need the label added to the corresponding pause and docker registry images.
A fix for this issue was implemented in TKR v1.32 for vSphere 8.0 and any higher TKR versions
The affected cluster shows Ready True state despite this missing pause image issue because health checks have found that the CNI (antrea or calico) is Healthy.