Containerd tries to stop a container but fails with error:
level=error msg="get state for <CONTAINER ID>" error="context deadline exceeded: unknown"level=error msg="Failed to handle backOff event &TaskExit{ContainerID:<CONTAINER ID>:<CONTAINER ID>,Pid:43417,ExitStatus:0,ExitedAt:2024-07-15 22:03:15.966586413 +0000 UTC,XXX_unrecognized:[],} for <CONTAINER ID>" error="failed to handle container TaskExit event: failed to stop container: context deadline exceeded: unknown"
Output of "crictl ps" shows that the container is in state Running but checking the status of the pid associated with the container confirms that the container has been deleted.
crictl ps
crictl inspect <container> | grep -i pid
ps -fe | grep <pid>
TKGm cluster with containerd
While there can be a number of reason why a container cant stop, it has been observed that if the open file description has reached the limit then containerd gets into a state and cant stop containers.
Check the number of open file descriptors
lsof 2>/dev/null | awk '{print "PID="$2 " COMMAND=" $1}'| uniq -c| sort -n -r | awk '{sum+=$1;}END{print sum;}'
Identify the processes consuming the most file descriptors
lsof 2>/dev/null | awk '{print "PID="$2 " COMMAND=" $1}'| uniq -c| sort -n -r | head -n 20
Check if max file limit has been reached, search for "file-max limit". See Linux sysctl documentation.
dmesg -T
Max file limit can be checked with
sysctl -a | grep file-If the file limited has been reached, a reboot of the Worker node is required.
To prevent a re-occurrence of the issue, the max file limit should be increased to meet the workload requirements.