vSphere Kubernetes Cluster Node Containerd Failing with error messages "Failed to run CRI service" and "Failed to recover state"
search cancel

vSphere Kubernetes Cluster Node Containerd Failing with error messages "Failed to run CRI service" and "Failed to recover state"

book

Article ID: 313100

calendar_today

Updated On:

Products

VMware vSphere with Tanzu VMware vSphere 7.0 with Tanzu

Issue/Introduction

While connected to the affected vSphere Kubernetes cluster context, the following symptoms are present:

  • Pods are unhealthy or not running on the affected node:
    • kubectl get pods -A -o wide | grep <affected node name>
  • The affected node is in NotReady state:
    • kubectl get nodes

 

While SSH directly into the affected node, the following symptoms are present:

  • Checking for containers running on the node returns the following error message:
    • crictl ps
      
    • connect: connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
  • The logs for the containerd system service show error messages similar to the following, indicating that the CRI service is not running:
    • journalctl -xeu containerd
    • level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name"

 

 

Environment

VMware vSphere 7.0 with Tanzu

Cause

Corrupted data in the containerd directory is preventing the CRI service from starting.

Without a functioning CRI service, pods and containers cannot run on the affected node.

Resolution

NOTE: This KB is tailored for containerd and CRI service issues on a vSphere Kubernetes cluster, also known as a guest cluster.

If this KB article's symptoms match container issues in the Supervisor cluster, please reach out to VMware by Broadcom Support referencing this KB article for assistance in fixing the Supervisor cluster's containerd.

 

IMPORTANT: The following steps require that another node in the guest cluster is in a healthy state. If there are no nodes in a healthy state, please reach out to VMware by Broadcom Support referencing this KB article.

 

Containerd will need to be recreated on the affected node within the guest cluster. 

  1. Connect to the Supervisor cluster context as root
  2. Pause the affected guest cluster:
    • kubectl patch cluster <affected cluster> -n <affected cluster namespace> --type merge -p '{"spec":{"paused": true}}'
  3. Identify the affected node's IP address:
    • kubectl get vm -o wide -n <affected cluster namespace>
  4. SSH into the affected node as breakglass user vmware-system-user:
  5. Establish root privileges:
    • sudo su
  6. Stop system services for containerd and kubelet:
    • systemctl stop containerd
    • systemctl stop kubelet
  7. Confirm containerd and kubelet services were stopped:
    • systemctl status containerd
    • systemctl status kubelet
  8. Take a backup of the containerd directory:
    • mv /var/lib/containerd /var/lib/containerd_old
  9. Create a new containerd directory and grant it proper permissions:
    • mkdir /var/lib/containerd
    • chmod 711 /var/lib/containerd
  10. Perform a reboot of the affected node to clean up any potentially stale containerd processes:
    • reboot now
  11. From the Supervisor cluster, identify a healthy node's IP address in the guest cluster:
    • kubectl get vm -o wide -n <affected cluster namespace>
  12. SSH into the healthy node as breakglass user vmware-system-user
  13. Establish root privileges:
    • sudo su
  14. Locate the docker-registry and pause images on the healthy node:
    • ctr -n k8s.io images list | grep -e pause -e docker-registry
  15. Export the docker-registry and pause images as tar files on the healthy node:
    • ctr -n k8s.io images export pause.tar <Pause Image Ref>
    • ctr -n k8s.io images export docker-registry.tar <Docker Registry Image Ref>
  16. Move the above tar files onto the affected node.
  17. Connect into the affected node as vmware-system-user.
  18. Stop the kubelet system service:
    • systemctl stop kubelet
    • systemctl status kubelet
  19. Import the tar files as images in the affected node:
    • ctr -n k8s.io images import pause.tar 
    • ctr -n k8s.io images import docker-registry.tar
  20. Confirm that the images were imported successfully:
    • ctr -n k8s.io images list
  21. Start kubelet system service:
    • systemctl start kubelet
    • systemctl status kubelet
  22. Confirm that containers are now running in the affected node:
    • crictl ps
  23. The above steps will need to be performed for each node affected by this containerd issue.
  24. Once all affected nodes are fixed with importing the missing images, unpause the guest cluster from the Supervisor cluster:
    • kubectl patch cluster <Workload Cluster> -n <namespace> --type merge -p '{"spec":{"paused": false}}'