kubelet shows "rpc error: code = DeadlineExceeded desc = context deadline exceeded" and pods in Init state
search cancel

kubelet shows "rpc error: code = DeadlineExceeded desc = context deadline exceeded" and pods in Init state

book

Article ID: 298657

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Issue and Symptoms:

Pods become stuck in Init status

From the output of:
kubectl describe pod XXX 

You may see the following:
Warning FailedCreatePodSandBox 93s (x8 over 29m) kubelet, 97011e0a-f47c-4673-ace7-d6f74cde9934 Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Normal SandboxChanged 92s (x8 over 29m) kubelet, 97011e0a-f47c-4673-ace7-d6f74cde9934 Pod sandbox changed, it will be killed and re-created.


From kubelet.stderr.log you would see these errors for containers:
Example:
E0114 14:57:13.656196    9838 remote_runtime.go:128] StopPodSandbox "ca05be4d6453ae91f63fd3f240cbdf8b34377b3643883075a6f5e05001d3646b" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
...
E0114 14:57:13.656256    9838 kuberuntime_manager.go:901] Failed to stop sandbox {"docker" "ca05be4d6453ae91f63fd3f240cbdf8b34377b3643883075a6f5e05001d3646b"}
...
W0114 14:57:30.151650    9838 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "ca05be4d6453ae91f63fd3f240cbdf8b34377b3643883075a6f5e05001d3646b"


You can also validate the status of the node-agent-hyperbus by running the following nsxcli command from the node (as root):
sudo -i
/var/vcap/jobs/nsx-node-agent/bin/nsxcli

"at the nsx-cli prompt, enter": get node-agent-hyperbus status

Expected output:
HyperBus status: Healthy

In this scenario you would see the following error instead:
% An internal error occurred
 


Additional Details:

This causes a loop of DEL (delete) requests to the nsx-node-agent process 


Environment

Product Version: 1.9

Resolution

Restarting the nsx-node-agent process will workaround this issue:


-- Use bosh ssh to access the worker node

-- sudo -i

-- monit restart nsx-node-agent

-- Wait for nsx-node-agent to restart:  watch monit summary


NOTE:  As of 2020.Feb.01, this is being addressed via internal ID PKS-1010 and is still an Open issue.  Please open a support case if you have further questions.