CSI-controller Pod stuck in Terminating/ContainerCreating with kubelet FailedKillPod and FailedCreatePodSandBox failures
search cancel

CSI-controller Pod stuck in Terminating/ContainerCreating with kubelet FailedKillPod and FailedCreatePodSandBox failures

book

Article ID: 435491

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Workload cluster pods were unable to transition out of ContainerCreating or Terminating states due to underlying Container Network Interface (CNI) failures. Specifically, kubelet was failing to execute sandbox RPC calls, heavily impacting the csi-controller deployment.

Pod status inspection revealed csi-controller pods stuck in Terminating and ContainerCreating states.

Kubelet logs/events displayed sandbox termination failures:

Warning FailedKillPod 2m3s (x2421 over 8h) kubelet error killing pod: failed to "KillPodSandbox" for "xxx" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"xxx\": plugin type=\"linkerd-cni\" name=\"linkerd-cni\" failed (delete): failed to find plugin \"linkerd-cni\" in path [/opt/cni/bin]"

Kubelet logs/events displayed sandbox creation failures:

Warning FailedCreatePodSandBox 50s (x513 over 113m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "xxx": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

Environment

VKS 3.4.0+v1.33.

Cause

The primary cause appeared to be a disconnection between the Antrea CNI controllers and the Kubernetes API server, leading to a loss of the leader election lease and subsequent failure to program pod network sandboxes. A secondary issue existed where the linkerd-cni binary on a specific control plane node was experiencing a segmentation fault.

Inspection of the vmware-system-antrea interworking pod logs confirmed API and etcd communication timeouts, preventing the renewal of the controller resource lock:

E0321 09:09:15.186276 1 leaderelection.go:429] Failed to update lock optimitically: Put "https://##.##.#.#:443/apis/coordination.k8s.io/v1/namespaces/vmware-system-antrea/leases/my-lock": read tcp ###.##.##.###:#####->##.##.#.#:443: read: connection reset by peer
E0321 23:08:30.979235 1 leaderelection.go:429] Failed to update lock optimitically: etcdserver: request timed out
E0322 00:12:11.030448 1 leaderelection.go:429] ... dial tcp ##.##.#.#:443: connect: connection refused

Resolution

Restart the Antrea controller and agent pods in the workload cluster to force a re-established API connection

kubectl delete pods -n vmware-system-antrea -l app=antrea


Verify the status of the CSI-controller pods restart them if required.

Restarting the Antrea pods flushes stale connection states and forces the re-initialization of the CNI configuration daemon on all nodes. Following the restart, the kubelet was able to successfully process the pending RPC calls, and the csi-controller pods transitioned to a Running state, restoring cluster scheduling capabilities.