vSphere Supervisor Workload Cluster Pods are stuck in ContainerCreating or CrashLoopBackOff due to Missing vmware Pause Image
search cancel

vSphere Supervisor Workload Cluster Pods are stuck in ContainerCreating or CrashLoopBackOff due to Missing vmware Pause Image

book

Article ID: 379856

calendar_today

Updated On:

Products

VMware vSphere 7.0 with Tanzu VMware vSphere with Tanzu vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

System pods in a vSphere Supervisor workload cluster are failing to run and remain stuck in CrashLoopBackOff or ContainerCreating state.

This can cause newly created workload cluster system pods or an upgrade of the workload cluster to hang or fail to progress.

 

While connected to the affected workload cluster's context, the following symptoms are present:

  • System pods are stuck in ContainerCreating, CrashLoopBackOff, or Exited state:
    • kubectl get pods -A | grep -v Run
  • Describing one of the above pods shows an error message similar to the following, where the version of the pause image will vary by environment but the pod is trying to pull from the node's docker-registry process:
    • kubectl describe pod <pod-name> -n <pod-namespace>
    • Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost:5000/vmware/pause:#.#": failed to pull image "localhost:5000/vmware/pause:#.#": failed to pull and unpack image "localhost:5000/vmware/pause:#.#": failed to resolve reference "localhost:5000/vmware/pause:#.#": failed to do request: Head "http://localhost:5000/v2/vmware.io/": dial tcp 127.0.0.1:5000: connect: connection refused

 

While SSH to the affected node(s), one or more of the following symptoms are present:

  • The docker registry container is not running or was recently restarted:
    • crictl ps -a | grep docker
  • The above pause image is not present in the node:

Environment

vSphere with Tanzu 7.0

vSphere with Tanzu 8.0

This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

Containerd cannot start vmware-system or kube-system pods without an available pause image on the affected node.

The noted missing pause image has been garbage collected by known Kubernetes issue related to DiskPressure in the node: https://github.com/kubernetes/kubernetes/issues/81756

This issue can impact any number of nodes in the affected cluster.

Existing pods that had successfully started when the pause image was originally present will be unaffected by this issue until they are recreated.

Resolution

Please open a ticket to VMware by Broadcom Technical Support referencing this KB for assistance in recovering the pause image and adding a label to prevent garbage collection of the missing images.
NOTE: Because this is a potential garbage collection issue which is fixed by adding a label, all nodes in the cluster (including any future nodes!) will need the label added to the corresponding pause and docker registry images.

Additional Information

A fix for this issue was implemented in TKR v1.32 for vSphere 8.0 and any higher TKR versions

The affected cluster shows Ready True state despite this missing pause image issue because health checks have found that the CNI (antrea or calico) is Healthy.