This article provides steps to scope the problem then provide a workaround for resolution.
Symptoms:
1. In vSphere 7.0 U3, after an HA failover or reboot of a TKGS Worker Node, pods will show stuck in ContainerCreating state.
2. This condition is specifically seen when the TKGS Guest Cluster has Worker Nodes configured to use /var/lib/containerd ephemeral volumes. We will not see this condition on Worker Nodes created with no ephemeral storage.
3. If we show the pod output in wide format (command below), the pods stuck in ContainerCreating state will all be attempting startup on the node that was failed over by vSphere HA:
- kubectl get pods -A -o wide | grep ContainerCreating
4. When describing the hung pods, we see errors similar to the below where the pause image version may vary based on your environment:
- kubectl describe pod <problem_pod_name> -n <namespace_pod_resides_in>
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 66s (x32345 over 4d20h) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost:5000/vmware.io/pause:3.4.1": failed to pull image "localhost:5000/vmware.io/pause:3.4.1": failed to pull and unpack image "localhost:5000/vmware.io/pause:3.4.1": failed to resolve reference "localhost:5000/vmware.io/pause:3.4.1": failed to do request: Head "http://localhost:5000/v2/vmware.io/pause/manifests/3.4.1": dial tcp 127.0.0.1:5000: connect: connection refused
5. Most importantly, when SSH'ed into the Worker Node directly, you see no containers running when running the crictl ps command.