Control Plane Failures Due to Missing Local Pause Container Image
search cancel

Control Plane Failures Due to Missing Local Pause Container Image

book

Article ID: 398499

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

In a vSphere Kubernetes Service Cluster, core control plane components (including etcd, kube-apiserver, kube-scheduler,antrea etc) fail to start due to a missing container image. Affected pods log the following error:

Failed to do request: Head "http://localhost:5000/v2/vmware.io/pause/manifests/<version>": dial tcp <localhost IP>:5000: connect: connection refused

Additionally, reviewing the kubelet logs (journalctl -xue kubelet) may reveal the following error:

Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to get sandbox image \"localhost:5000/vmware.io/pause:<version>\": failed to pull image.

These errors indicate that the container runtime is attempting to resolve a locally tagged image (localhost:5000/vmware.io/pause:<version>) that is no longer available.

 

Environment

VMware vSphere with Tanzu

VMware vSphere Kubernetes Service.

Cause

The pause:<version> image is designed to load directly from the local node cache, rather than downloading from an external network. Probable causes for its disappearance include:

  • Unexpected Shutdown: A sudden power loss or system crash can interrupt the node while it is saving data. If the system restarts before the image files are permanently written to the storage drive, the local image might be lost.

  • Automated Disk Cleanup: The system continuously monitors available disk space. If the storage drive becomes too full, an automated cleanup process is started to delete unused files and free up room, see "Kubelet garbage-collects the pause image during routine, #81756".

 

Resolution

  1. Establish an SSH session into the affected control plane node(s).

  2. Pull the required container image by executing the following command (replace <version> with the version noted in the error logs): sudo crictl pull registry.k8s.io/pause:<version>

  3. Re-tag the image locally by executing the following commands:
    sudo crictl -n k8s.io image tag registry.k8s.io/pause:<version> localhost:5000/vmware.io/pause:<version>
    OR
    sudo ctr -n k8s.io image tag registry.k8s.io/pause:<version> localhost:5000/vmware.io/pause:<version>

  4. Verify that the affected pods restart and services recover.

  5. If the control plane containers (such as etcd or kube-apiserver) do not automatically recover after the image is retagged, restart the container runtime and kubelet services by executing the following commands: sudo systemctl restart containerd sudo systemctl restart kubelet

Additional Information

vSphere Supervisor Workload Cluster Pods are stuck in ContainerCreating or CrashLoopBackOff due to Missing vmware Pause Image