While connected to the affected vSphere Kubernetes cluster context, the following symptoms are present:
kubectl get pods -A -o wide | grep <affected node name>
kubectl get nodes
While SSH directly into the affected node, the following symptoms are present:
crictl ps
connect: connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
journalctl -xeu containerd
level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name"
Corrupted data in the containerd directory is preventing the CRI service from starting.
Without a functioning CRI service, pods and containers cannot run on the affected node.
NOTE: This KB is tailored for containerd and CRI service issues on a vSphere Kubernetes cluster, also known as a guest cluster.
If this KB article's symptoms match container issues in the Supervisor cluster, please reach out to VMware by Broadcom Support referencing this KB article for assistance in fixing the Supervisor cluster's containerd.
IMPORTANT: The following steps require that another node in the guest cluster is in a healthy state. If there are no nodes in a healthy state, please reach out to VMware by Broadcom Support referencing this KB article.
Containerd will need to be recreated on the affected node within the guest cluster.
kubectl patch cluster <affected cluster> -n <affected cluster namespace> --type merge -p '{"spec":{"paused": true}}'
kubectl get vm -o wide -n <affected cluster namespace>
sudo su
systemctl stop containerd
systemctl stop kubelet
systemctl status containerd
systemctl status kubelet
mv /var/lib/containerd /var/lib/containerd_old
mkdir /var/lib/containerd
chmod 711 /var/lib/containerd
reboot now
kubectl get vm -o wide -n <affected cluster namespace>
sudo su
ctr -n k8s.io images list | grep -e pause -e docker-registry
ctr -n k8s.io images export pause.tar <Pause Image Ref>
ctr -n k8s.io images export docker-registry.tar <Docker Registry Image Ref>
systemctl stop kubelet
systemctl status kubelet
ctr -n k8s.io images import pause.tar
ctr -n k8s.io images import docker-registry.tar
ctr -n k8s.io images list
systemctl start kubelet
systemctl status kubelet
crictl ps
kubectl patch cluster <Workload Cluster> -n <namespace> --type merge -p '{"spec":{"paused": false}}'