vSphere Kubernetes Cluster Node Containerd Failing with error messages "Failed to run CRI service" and "Failed to recover state"

Products

VMware vSphere Kubernetes Service VMware vSphere 7.0 with Tanzu

Issue/Introduction

While connected to the affected vSphere Kubernetes cluster context, the following symptoms are present:

Pods are unhealthy or not running on the affected node:

kubectl get pods -A -o wide | grep <affected node name>

The affected node is in NotReady state:
- ```
kubectl get nodes
```

While SSH directly into the affected node, the following symptoms are present:

Checking for containers running on the node returns the following error message:

```
crictl ps
```

connect: connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded

The logs for the containerd system service show error messages similar to the following, indicating that the CRI service is not running:
- ```
journalctl -xeu containerd
```
- ```
level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name"
```

Environment

VMware vSphere 7.0 with Tanzu

Cause

Corrupted data in the containerd directory is preventing the CRI service from starting.

Without a functioning CRI service, pods and containers cannot run on the affected node.

Resolution

NOTE: This KB is tailored for containerd and CRI service issues on a vSphere Kubernetes cluster, also known as a guest cluster.

If this KB article's symptoms match container issues in the Supervisor cluster, please reach out to VMware by Broadcom Support referencing this KB article for assistance in fixing the Supervisor cluster's containerd.

IMPORTANT: The following steps require that another node in the guest cluster is in a healthy state. If there are no nodes in a healthy state, please reach out to VMware by Broadcom Support referencing this KB article.

Containerd will need to be recreated on the affected node within the guest cluster.

Connect to the Supervisor cluster context as root

Pause the affected guest cluster:

kubectl patch cluster <affected cluster> -n <affected cluster namespace> --type merge -p '{"spec":{"paused": true}}'

Identify the affected node's IP address:

kubectl get vm -o wide -n <affected cluster namespace>

SSH into the affected node as breakglass user vmware-system-user:
- Documentation:SSH directly into the affected node
Establish root privileges:
- ```
sudo su
```
Stop system services for containerd and kubelet:
- ```
systemctl stop containerd
```
- ```
systemctl stop kubelet
```
Confirm containerd and kubelet services were stopped:
- ```
systemctl status containerd
```
- ```
systemctl status kubelet
```

Take a backup of the containerd directory:

mv /var/lib/containerd /var/lib/containerd_old

Create a new containerd directory and grant it proper permissions:
- ```
mkdir /var/lib/containerd
```
- ```
chmod 711 /var/lib/containerd
```
Perform a reboot of the affected node to clean up any potentially stale containerd processes:
- ```
reboot now
```
From the Supervisor cluster, identify a healthy node's IP address in the guest cluster:
- ```
kubectl get vm -o wide -n <affected cluster namespace>
```
SSH into the healthy node as breakglass user vmware-system-user
Establish root privileges:
- ```
sudo su
```
Locate the docker-registry and pause images on the healthy node:
- ```
ctr -n k8s.io images list | grep -e pause -e docker-registry
```

Export the docker-registry and pause images as tar files on the healthy node:

ctr -n k8s.io images export pause.tar <Pause Image Ref>

ctr -n k8s.io images export docker-registry.tar <Docker Registry Image Ref>

Move the above tar files onto the affected node.
Connect into the affected node as vmware-system-user.

Stop the kubelet system service:

```
systemctl stop kubelet
```
```
systemctl status kubelet
```

Import the tar files as images in the affected node:

```
ctr -n k8s.io images import pause.tar 
```

ctr -n k8s.io images import docker-registry.tar

Confirm that the images were imported successfully:
- ```
ctr -n k8s.io images list
```

Start kubelet system service:

```
systemctl start kubelet
```
```
systemctl status kubelet
```

Confirm that containers are now running in the affected node:
- ```
crictl ps
```
The above steps will need to be performed for each node affected by this containerd issue.
Once all affected nodes are fixed with importing the missing images, unpause the guest cluster from the Supervisor cluster:
- ```
kubectl patch cluster <Workload Cluster> -n <namespace> --type merge -p '{"spec":{"paused": false}}'
```