Containerd fails to stop container with error "failed to handle container TaskExit event: failed to stop container"

search cancel

Containerd fails to stop container with error "failed to handle container TaskExit event: failed to stop container"

book

Article ID: 373638

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Plus VMware Tanzu Kubernetes Grid Plus 1.x VMware Tanzu Kubernetes Grid 1.x

Issue/Introduction

Containerd tries to stop a container but fails with error:

level=error msg="get state for <CONTAINER ID>" error="context deadline exceeded: unknown"
level=error msg="Failed to handle backOff event &TaskExit{ContainerID:<CONTAINER ID>:<CONTAINER ID>,Pid:43417,ExitStatus:0,ExitedAt:2024-07-15 22:03:15.966586413 +0000 UTC,XXX_unrecognized:[],} for <CONTAINER ID>" error="failed to handle container TaskExit event: failed to stop container: context deadline exceeded: unknown"

Output of "crictl ps" shows that the container is in state Running but checking the status of the pid associated with the container confirms that the container has been deleted.

crictl ps

crictl inspect <container> | grep -i pid

ps -fe | grep <pid>

Environment

TKGm cluster with containerd

Cause

While there can be a number of reason why a container cant stop, it has been observed that if the open file description has reached the limit then containerd gets into a state and cant stop containers.

Check the number of open file descriptors

lsof 2>/dev/null | awk '{print "PID="$2 " COMMAND=" $1}'| uniq -c| sort -n -r | awk '{sum+=$1;}END{print sum;}'

Identify the processes consuming the most file descriptors

lsof 2>/dev/null | awk '{print "PID="$2 " COMMAND=" $1}'| uniq -c| sort -n -r | head -n 20

Check if max file limit has been reached, search for "file-max limit". See Linux sysctl documentation.

dmesg -T

Max file limit can be checked with

sysctl -a | grep file-

NOTE: For Photon OS the max file limit by default is dependant on the RAM, fs.file-max = Total RAM in MB x 100

Resolution

If the file limited has been reached, a reboot of the Worker node is required.

To prevent a re-occurrence of the issue, the max file limit should be increased to meet the workload requirements.

Feedback

thumb_up Yes

thumb_down No