PODs stuck in ContainerCreating state on TKGS Guest Clusters after vSphere HA Failover event

Products

VMware vSphere ESXi VMware vSphere with Tanzu

Issue/Introduction

This article provides steps to scope the problem then provide a workaround for resolution.

Symptoms:

1. In vSphere 7.0 U3, after an HA failover or reboot of a TKGS Worker Node, pods will show stuck in ContainerCreating state.
2. This condition is specifically seen when the TKGS Guest Cluster has Worker Nodes configured to use /var/lib/containerd ephemeral volumes. We will not see this condition on Worker Nodes created with no ephemeral storage.
3. If we show the pod output in wide format (command below), the pods stuck in ContainerCreating state will all be attempting startup on the node that was failed over by vSphere HA:

# kubectl get pods -A -o wide | grep ContainerCreating

4. When describing the hung pods, we see errors similar to:

# kubectl describe pod <problem_pod_name> -n <namespace_pod_resides_in>

Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 66s (x32345 over 4d20h) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost:5000/vmware.io/pause:3.4.1": failed to pull image "localhost:5000/vmware.io/pause:3.4.1": failed to pull and unpack image "localhost:5000/vmware.io/pause:3.4.1": failed to resolve reference "localhost:5000/vmware.io/pause:3.4.1": failed to do request: Head "http://localhost:5000/v2/vmware.io/pause/manifests/3.4.1": dial tcp 127.0.0.1:5000: connect: connection refused

5. Most importantly, when SSH'ed into the Worker Node directly, you see no containers running when running the crictl ps command. (Please see the Workaround steps in Option 2 for details on how to check this)

Environment

VMware vSphere 7.0 with Tanzu

Cause

Containerd creates an overlay filesystem mounted in the root volume by default if there is no ephemeral mount point already identified. In this failure condition, containerd fails to restart after the ephemeral mount is completed. This leads to a state where containerd references a filesystem that no longer has reference to the container images it used prior to kernel reboot.

We see a resulting failure to reference the Sandbox location in order to start the kube-proxy service, upon which the CNI and all other system containers depend. This causes a condition where kubelet is started and the Worker Node incorrectly appears healthy to the Control Plane nodes, leading to scheduling attempts that cannot be completed.

Resolution

This is resolved in TKr v1.23.8---vmware.3-tkg.1

Workaround:
There are two options when a Worker Node is stuck in this state:

Delete the Worker Node. This will automatically trigger a new node rollout and the containerd mount will be reset.
SSH connect directly into the Worker Node and restart containerd service.

To apply the workaround:

----------------------------------------------------
Identify the Problem Node:

1. Use the kubectl vsphere login command from your jumpbox to connect to the TKGS Guest Cluster. The following documentation will help with this: Kubectl vSphere Login
2. Identify the pods stuck in ContainerCreating state using: # kubectl get pods -A -o wide | grep ContainerCreating
3. Note the Worker Node on which the pods stuck in ContainerCreating are being scheduled

----------------------------------------------------
Option 1 Delete Problem Node:

1. Use the kubectl vsphere login command from your jumpbox to connect to the TKGS Guest Cluster.
2. List the nodes using: # kubectl get nodes
3. Delete the problem node identified above: # kubectl delete node <nodename>

----------------------------------------------------
Option 2 Restart Containerd on Problem Node:

1. Use the following command to identify the machine and wcpmachine that back the problem node. Gather the IP address associated

# kubectl get machine,wcpmachine -n <TKGS_CLUSTER_NAMESPACE> | grep <CLUSTER_NAME>
Identify the machine that has the same name as the problem node. Note the ProviderID
Find the wcpmachine that has the same ProviderID as the machine associated with the problem node. Note the IP address.
Use the following documentation to connect via SSH into the problem node IP identified above: Log Into TKGS Guest Cluster

2. Once logged into the Problem Node, enable root privilege: # sudo su
3. Confirm there are no pods running on the node: # crictl ps

Example output:

# crictl ps

CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
-----------

4. If there are no containers running, restart containerd: # systemctl restart containerd

NOTE: If there are containers running and containerd is restarted, the containers will also be restarted. Please don't restart containerd if you see containers running. If any containers are running, the environment is experiencing a different problem.

5. Wait a few moments, then check to see if pods are starting: # crictl ps

Additional Information

Impact/Risks:
Pods will continue to start and run on Worker Nodes that have not been restarted, however, this can lead to a condition where certain pods are incorrectly scheduled on the partially started Worker Node that was rebooted. This might impact workloads dependent on which pods are stuck in ContainerCreating state.