After clearing space from a full root partition Pods display as NotReady and core control plane pods (e.g., etcd, kube-apiserver) are stuck in Pending, ContainerCreating, or CrashLoopBackOff
search cancel

After clearing space from a full root partition Pods display as NotReady and core control plane pods (e.g., etcd, kube-apiserver) are stuck in Pending, ContainerCreating, or CrashLoopBackOff

book

Article ID: 437844

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Etcd may show a node isolated

After connecting to that node you may see errors as outlined below

systemctl status kubelet:
"RunPodSandbox from runtime service failed" err-"rpc error: code - Unknown desc
 "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = fa
"CreatePodSandbox for pod failed" err-"rpe error: cod= - Unknown desc -
 "Error ayncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"capi-cont

journalctl -u containerd:
failed, error" error="failed to get sandbox image \"docker. 1o/vmware/pause: 1. 28.3\": failed to pull image \"dock
r. io/vmware/pause:1.28.3\": failed to pull and unpack image \"docker. io/vmware/pause:1.28.3\": failed to resolve reference \"docker. io/vmware/pause:1.28.3\": failed to do request: Head \"https://regi:try-1.docker.1o/v2/vmware/pause/manifests/1.28.3\": read top read: connection reset by peer"

if you curl from the broken node to the registry it fails this will prevent us from manually pulling the image:
curl -v https://<registry from errors>
* Recv failure: Connection reset by peer
* OpenSSL SSL_connect: Connection reset by peer in connection to <registry>
* Closing connection 0
curl: (35) Recv failure: Connection reset by peer

Environment

vSphere Kubernetes Services 

Cause

Missing Pause Image  : Disk usage >85% triggers Kubelet to delete cached images, including the pause image. Air-gapped nodes or network isolated nodes cannot re-download it from external registries.

Resolution

  1. Export the image from a healthy control plane node:
    SSH into an unaffected node where the image is still active. Use the ctr utility to
    export the exact image tag referenced in the error logs.
    ctr -n k8s.io images export pause.tar docker.io/vmware/pause:<version>

  2. Transfer the File to the Broken Node:
    scp pause.tar root@<broken node ip>:/root/

  3. Import the Image on the broken Node
    SSH into the failing node and import the image specifically into the k8s.io namespace
    ctr -n k8s.io images import pause.tar

  4. Restart container services:
    systemctl restart containerd
    systemctl restart kubelet