vSphere Kubernetes Cluster Pods stuck in ContainerCreating state after vSphere HA Failover event

Products

VMware vSphere 7.0 with Tanzu

Issue/Introduction

This article provides steps to scope the problem then provide a workaround for resolution.

Symptoms:

1. In vSphere 7.0 U3, after an HA failover or reboot of a TKGS Worker Node, pods will show stuck in ContainerCreating state.

2. This condition is specifically seen when the TKGS Guest Cluster has Worker Nodes configured to use /var/lib/containerd ephemeral volumes. We will not see this condition on Worker Nodes created with no ephemeral storage.

3. If we show the pod output in wide format (command below), the pods stuck in ContainerCreating state will all be attempting startup on the node that was failed over by vSphere HA:

kubectl get pods -A -o wide | grep ContainerCreating

4. When describing the hung pods, we see errors similar to the below where the pause image version may vary based on your environment:

kubectl describe pod <problem_pod_name> -n <namespace_pod_resides_in>

Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 66s (x32345 over 4d20h) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost:5000/vmware.io/pause:3.4.1": failed to pull image "localhost:5000/vmware.io/pause:3.4.1": failed to pull and unpack image "localhost:5000/vmware.io/pause:3.4.1": failed to resolve reference "localhost:5000/vmware.io/pause:3.4.1": failed to do request: Head "http://localhost:5000/v2/vmware.io/pause/manifests/3.4.1": dial tcp 127.0.0.1:5000: connect: connection refused

5. Most importantly, when SSH'ed into the Worker Node directly, you see no containers running when running the crictl ps command.

Environment

VMware vSphere 7.0 with Tanzu

Cause

Containerd creates an overlay filesystem mounted in the root volume by default if there is no ephemeral mount point already identified. In this failure condition, containerd fails to restart after the ephemeral mount is completed. This leads to a state where containerd references a filesystem that no longer has reference to the container images it used prior to kernel reboot.

We see a resulting failure to reference the sandbox location in order to start the kube-proxy service, upon which the CNI and all other system containers depend. This causes a condition where kubelet is started and the worker node incorrectly "appears" healthy to the Control Plane nodes, leading to scheduling attempts that cannot be completed.

This is resolved in TKR v1.23.8---vmware.3-tkg.1 and higher

Resolution

For any worker node stuck in this state where its pods remain in ContainerCreating state, the containerd service will need to be restarted.

To apply the workaround:

----------------------------------------------------
Identify the Problem Node:

Connect to the affected guest cluster's context
Identify the pods stuck in ContainerCreating state:
- ```
kubectl get pods -A -o wide | grep -v Run
```
Note the Worker Node on which the pods stuck in ContainerCreating are being scheduled from the above command

Restart Containerd on Problem Node:

Connect to the Supervisor cluster as root

Identify the problem node's IP address:

kubectl get vm -o wide -n <affected cluster's namespace>

Connect into the problem node as breakglass user vmware-system-user:
- Documentation:SSH to Tanzu Kubernetes Cluster Nodes as the System User Using a Password
Once logged into the Problem Node, enable root privilege:
- ```
sudo su
```
Confirm there are no containers running on the node (this command should return an empty output):
- ```
crictl ps
```
If there are no containers running, restart containerd:
- Note: If there are containers running and containerd is restarted, the containers will also be restarted.
  Do not restart containerd if you see containers running. If any containers are running, the environment is experiencing a different problem than this KB.
- ```
systemctl restart containerd
```
Wait a few moments, then check if pods are starting post-restart of containerd:
- ```
crictl ps
```

Additional Information

Impact/Risks:
Pods will continue to start and run on Worker Nodes that have not been restarted, however, this can lead to a condition where certain pods are incorrectly scheduled on the partially started Worker Node that was rebooted. This might impact workloads dependent on which pods are stuck in ContainerCreating state.