vmware-system or kube-system pods in a vSphere Kubernetes cluster are failing to come up and remain stuck in ContainerCreating or CrashLoopBackOff.
While connected to the affected cluster's context, the following symptoms are present:
While SSH to the affected node(s), the following symptoms are present:
vSphere with Tanzu 7.0
vSphere with Tanzu 8.0
This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)
Containerd cannot start vmware-system or kube-system pods without an available pause image on the affected node.
The noted missing pause image has been garbage collected by known kubernetes issue: https://github.com/kubernetes/kubernetes/issues/81756
This issue can impact any number of nodes in the affected cluster.
Please open a ticket to VMware by Broadcom Technical Support referencing this KB for assistance in recovering the pause image and adding a label to prevent garbage collection of the missing images.
Note that because this is a potential garbage collection issue which is fixed by adding a label, all nodes in the cluster (including any future nodes!) will need the label added to the corresponding pause and docker registry images.
A fix for this issue was implemented in vSphere 8.0 TKRs beginning in 1.32:
https://github.com/kubernetes-sigs/image-builder/pull/1373/files