containerd failing with error messages "Failed to run CRI service" and "failed to recover state"
search cancel

containerd failing with error messages "Failed to run CRI service" and "failed to recover state"

book

Article ID: 313100

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere with Tanzu

Issue/Introduction

Symptoms:

Pods not running on the node and "crictl ps" reports error:

# crictl ps
connect: connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded

 

Containerd logs show that CRI service is not running

level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name

 


Environment

VMware vSphere 7.0 with Tanzu

Cause

There is corrupt data in the containerd directory and this is preventing the CRS service from starting.

Resolution

See workaround

Workaround:
The following steps can be executed to resolve the issue.
The easiest option is to recreate the node as below but it is not recommended if you are experiencing the problem on all Control Plane nodes as it can result in data loss.
kubectl config use-context <Supervisor Cluster>
kubectl get machine -n <namespace>
kubectl delete machine -n <namespace>


If you are experiencing the problem on all Control Plane node, then execute the steps below to restore containerd.

Pause cluster reconciliation
kubectl config use-context <Supervisor Cluster>
kubectl patch cluster <Workload Cluster> -n <namespace> --type merge -p '{"spec":{"paused": true}}'




Cleanup containerd directory
Stop containerd and kubelet
systemctl stop containerd
systemctl status containerd
systemctl stop kubelet 
systemctl status kubelet


Move the contents of  /var/lib/containerd
mv /var/lib/containerd /var/lib/containerd_old
mkdir /var/lib/containerd
chmod 711 /var/lib/containerd 


Reboot the VM to remove any stale containerd processes

Stop kubelet
systemctl stop kubelet 
systemctl status kubelet


Restore pause and docker-registry images.
The Images will need to be exported from another cluster node
ctr -n k8s.io images list | grep -e pause -e docker-registry
ctr -n k8s.io images export pause.tar <Pause Image Ref>
ctr -n k8s.io images export docker-registry.tar <Docker Registry Image Ref>

Copy these tar balls to the impacted cluster node and import them locally
ctr -n k8s.io images import pause.tar
ctr -n k8s.io images import docker-registry.tar
ctr -n k8s.io images list


Restart kubelet
systemctl start kubelet 
systemctl status kubelet

Confirm docker registry and kubernetes pods have restarted
crictl ps
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get pods -A


Repeat the above steps on remaining cluster nodes.

Resume cluster reconciliation once all nodes have been restored
kubectl config use-context <Supervisor Cluster>
kubectl patch cluster <Workload Cluster> -n <namespace> --type merge -p '{"spec":{"paused": false}}'