TKGi - Pods stuck in terminating status - KillContainer context deadline exceeded

search cancel

TKGi - Pods stuck in terminating status - KillContainer context deadline exceeded

book

Article ID: 409504

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Pods are stuck in terminating status for a long time:

NAMESPACE NAME READY STATUS RESTARTS AGE
### ######## 0/22 Terminating 0 5d6h

Kubelet logs on worker node "/var/vcap/sys/log/kubelet/kubelet.stderr.log" show errors like (for weblogic pod):

E0902 10:01:05.794087 9679 kubelet.go:2032] [failed to "KillContainer" for "weblogic" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPod#######" for "######################################" with KillPod######Error: "rpc error: code = DeadlineExceeded desc = failed to stop container \"######################################\": an error occurs during waiting for container \"######################################\" to be killed: wait containerand (for dynatrace-oneagent pod):

E0902 10:01:54.048022 11090 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"KillContainer\" for \"dynatrace-oneagent\" with KillContainerError: \"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \\\"######################################\\\" to be killed: wait container \\\"######################################\\\": context deadline exceeded\"" pod="dynatrace/dynakube-oneagent-abc12" podUID="########-####-####-####-##################"
This condition may lead to TKGI cluster upgrade failures when stopping worker node jobs, specifically the containerd job:

Task 1234 | 15:32:49 | L stopping jobs: worker/########-####-####-####-########1234 (1) (00:03:56)
L Error: Action Failed get_task: Task ########-####-####-####-########5678 result: Stopping Monitored Services: Stopping services '[containerd]' errored
The containerd-shim process will be stuck running "containerd_ctl stop" against the terminating container, leaving a stale containerd task:
- This can be identified by running the following commands
  - crictl ps | grep <pod_name> #--------------> Example: crictl ps | grep dynatrace-oneagent
  - ps -ef | grep containerd-shimExample showing the related containerd shim process for POD ID tc123dqps -ef | grep containerd-shim root 296150 1 0 Sep09 ? 00:05:30 /var/vcap/data/packages/containerd/78b921b6df42e5acdcefc9d099a31042f680857c/bin/containerd-shim-runc-v2 -namespace k8s.io -id tc123dq97448be4e030270d6073004fd4047c7350797f45a21ce257c0d -address /var/vcap/sys/run/containerd/containerd.sock

Environment

Issue observed on TKGI 1.20 and lower versions.

Cause

There are known issues with runc-shim and containerd that can cause processes to get hung up. See Git containerd issue 8847

These fixes for runc-shim are included in v1.7.22 and v1.6.36 of containerd.

Resolution

If you are seeing problems with pods getting stuck in terminating status, then upgrade to TKGI v1.21 or higher, which has containerd v1.7.23.

Workaround

Using the crictl ps and ps -ef commands listed in the Issue/Introduction section, identify the problem pod's process ID. Once you have determined the process ID, use kill to stop the process in order to allow graceful containerd shutdowns:

kill -9 296150

Feedback

thumb_up Yes

thumb_down No