TKGi - Pods stuck in terminating status - KillContainer context deadline exceeded
search cancel

TKGi - Pods stuck in terminating status - KillContainer context deadline exceeded

book

Article ID: 409504

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

  • Pods are stuck in terminating status for a long time:

    NAMESPACE    NAME         READY     STATUS          RESTARTS        AGE
    ###          ########     0/22      Terminating     0               5d6h

  • Kubelet logs on worker node "/var/vcap/sys/log/kubelet/kubelet.stderr.log" show errors like (for weblogic pod):

    E0902 10:01:05.794087 9679 kubelet.go:2032] [failed to "KillContainer" for "weblogic" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPod#######" for "######################################" with KillPod######Error: "rpc error: code = DeadlineExceeded desc = failed to stop container \"######################################\": an error occurs during waiting for container \"######################################\" to be killed: wait container


    and (for dynatrace-oneagent pod):

    E0902 10:01:54.048022   11090 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"KillContainer\" for \"dynatrace-oneagent\" with KillContainerError: \"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \\\"######################################\\\" to be killed: wait container \\\"######################################\\\": context deadline exceeded\"" pod="dynatrace/dynakube-oneagent-abc12" podUID="########-####-####-####-##################"




  • This condition may lead to TKGI cluster upgrade failures when stopping worker node jobs, specifically the containerd job:

    Task 1234  | 15:32:49 | L stopping jobs: worker/########-####-####-####-########1234 (1) (00:03:56)
                            L Error: Action Failed get_task: Task ########-####-####-####-########5678 result: Stopping Monitored Services: Stopping services '[containerd]' errored

  • The containerd-shim process will be stuck running "containerd_ctl stop" against the terminating container, leaving a stale containerd task:
    • This can be identified by running the following commands

      • crictl ps | grep <pod_name>        #--------------> Example: crictl ps | grep dynatrace-oneagent




      • ps -ef | grep containerd-shim


        Example showing the related containerd shim process for POD ID tc123dq

        ps -ef  | grep containerd-shim
        root      296150       1  0 Sep09 ?        00:05:30 /var/vcap/data/packages/containerd/78b921b6df42e5acdcefc9d099a31042f680857c/bin/containerd-shim-runc-v2 -namespace k8s.io -id tc123dq97448be4e030270d6073004fd4047c7350797f45a21ce257c0d -address /var/vcap/sys/run/containerd/containerd.sock

Environment

Issue observed on TKGI 1.20 and lower versions.

Cause

There are known issues with runc-shim and containerd that can cause processes to get hung up. See Git containerd issue 8847

These fixes for runc-shim are included in v1.7.22 and v1.6.36 of containerd.

Resolution

If you are seeing problems with pods getting stuck in terminating status, then upgrade to TKGI v1.21 or higher, which has containerd v1.7.23.

 

Workaround

 

Using the crictl ps and ps -ef commands listed in the Issue/Introduction section, identify the problem pod's process ID. Once you have determined the process ID, use kill to stop the process in order to allow graceful containerd shutdowns:

 

kill -9 296150