Inaccurate Container Status Reported By Containerd in vSphere Supervisor Workload Cluster
search cancel

Inaccurate Container Status Reported By Containerd in vSphere Supervisor Workload Cluster

book

Article ID: 409408

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

When viewing containers in a vSphere Supervisor Workload Cluster node, containerd incorrectly reports the status of a container as Running when its underlying process is not present on the node.

This can lead to nodes remaining stuck in Deleting state or cordoned state (SchedulingDisabled) because Kubernetes does not recognize that the container has been stopped.

 

From the Workload Cluster context, one or more of the following symptoms are observed:

  • The corresponding pod for the affected container(s) is still showing as Running or Terminating:
    kubectl get pods -n <pod namespace> -o wide

     

  • If the node is stuck Deleting, it will show as SchedulingDisabled state:
    kubectl get nodes

 

While SSH directly to the node where the pod and its container(s) is running:

  • The incorrect container status will be reported in the output of crictl ps directly on the node:
    crictl ps

     

  • Containerd logs report errors similar to the below:
    journalctl -xeu containerd
    
    MON DD HH:MM:SS <node name> containerd[988]: time="MON DD HH:MM:SS.ssssssZ" level=info msg="StopPodSandbox for \"<container ID>\""
    MON DD HH:MM:SS <node name> containerd[988]: time="MON DD HH:MM:SS.ssssssZ" level=error msg="StopPodSandbox for \"<container ID>\" failed" error="rpc error: code = DeadlineExceeded desc = failed to stop container \"<container ID>\": an error occurs during waiting for container \"<container ID>\" to be killed: wait container \"<container ID>\": context deadline exceeded"


Environment

vSphere Supervisor

vSphere Kubernetes Release (VKR) v1.30.1 or v1.29.4

Containerd v1.6.31

Cause

This is caused by a race condition between containerd's Exit and Exec probes.

Reference: https://github.com/containerd/containerd/issues/10589 introduced in Containerd v1.6.29 https://github.com/containerd/containerd/pull/9927

 

Resolution

Containerd shows a container in Running state, however, the process does not actually exist on the node.

This issue is present in the following vSphere Kubernetes Releases (VKR) which use containerd v1.6.31:

  • VKRs v1.30.1

  • VKR v1.29.4

 

Resolution

This issue was fixed in Containerd v1.6.36: https://github.com/containerd/containerd/pull/10676

VKRs running on Containerd v1.6.36 or higher will not encounter this issue.

Upgrade the affected workload cluster to a higher VKR version accordingly.

VKR Release Notes detail the containerd version included with the VKR:

VMware VKR Release Notes

 

Workaround

Restart containerd directly on the affected node which will correctly update the status of all containers on the node:

systemctl restart containerd

Additional Information