vSphere Kubernetes Cluster Pods Stuck ContainerCreating or CrashLoopBackOff due to Missing Pause Image
search cancel

vSphere Kubernetes Cluster Pods Stuck ContainerCreating or CrashLoopBackOff due to Missing Pause Image

book

Article ID: 379856

calendar_today

Updated On:

Products

VMware vSphere 7.0 with Tanzu VMware vSphere with Tanzu vSphere with Tanzu

Issue/Introduction

vmware-system or kube-system pods in a vSphere Kubernetes cluster are failing to come up and remain stuck in ContainerCreating or CrashLoopBackOff.

 

While connected to the affected cluster's context, the following symptoms are present:

  • vmware-system or kube-system pods are stuck in ContainerCreating or CrashLoopBackOff state
  • Describing a pod stuck in ContainerCreating or CrashLoopBackOff state shows Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost:5000/vmware/pause:X.X": failed to pull image "localhost:5000/vmware/pause:X.X": failed to pull and unpack image "localhost:5000/vmware/pause:X.X": failed to resolve reference "localhost:5000/vmware/pause:X.X": ... connect: connection refused

 

While SSH to the affected node(s), the following symptoms are present:

  • The docker registry container is not running or was recently restarted.
  • "crictl images list" does not contain an entry for the above pause image version.

Environment

vSphere with Tanzu 7.0

vSphere with Tanzu 8.0

This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

Containerd cannot start vmware-system or kube-system pods without an available pause image on the affected node.

The noted missing pause image has been garbage collected by known kubernetes issue: https://github.com/kubernetes/kubernetes/issues/81756

This issue can impact any number of nodes in the affected cluster.

Resolution

Please open a ticket to VMware by Broadcom Technical Support referencing this KB for assistance in recovering the pause image and adding a label to prevent garbage collection of the missing images.

Note that because this is a potential garbage collection issue which is fixed by adding a label, all nodes in the cluster (including any future nodes!) will need the label added to the corresponding pause and docker registry images.

Additional Information

A fix for this issue was implemented in vSphere 8.0 TKRs beginning in 1.32:
https://github.com/kubernetes-sigs/image-builder/pull/1373/files