vSphere Supervisor Workload Cluster nodes containers cannot start due to missing volumeMounts

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After a power-outage, one or more workload clusters are unhealthy and several virtual services are not coming up. These virtual services are run by pods within the affected workload clusters.

From the Supervisor cluster context, the affected workload clusters have a configuration where volumeMounts are defined for /var/lib/containerd and/or /var/lib/kubelet.

kubectl describe cluster <cluster name> -n <cluster namespace>

From within the workload cluster context, the affected workload cluster nodes show that pods are running as per kubectl, but there are no containers actually running on the node:

kubectl get pods -A -o wide | grep <node name>

SSH into the affected workload cluster node:
- See Connect to the TKG Service Cluster Control Plane as a Kubernetes Administrator
- List all running containers on the node. This KB assumes that this will return no output:
```
crictl ps
```
- There are no containerd volumeMount entries in the below output:
```
df -h
```
- Because /var/lib/containerd volumeMount is missing, no images are present in the node as per the below empty output:
```
crictl images list
```
- If /var/lib/kubelet volumeMount is missing, the kubelet system service will be failing or stuck in loaded state with an error message similar to the below indicating that kubelet cannot start without its config yaml found in this volumeMount:
```
systemctl status kubelet

journalctl -xeu kubelet

msg="Failed to load kubelet config file, path: /var/lib/kubelet/config.yaml", error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file \"/var/lib/kubelet/config.yaml\"
```

Environment

vSphere Supervisor

This issue can occur on any workload cluster regardless of management by Tanzu Mission Control (TMC)

Workload cluster configured with volumeMount for /var/lib/containerd and /var/lib/kubelet

ClusterClass 3.2 and below

Cause

Containers rely on the functionality of system service containerd which manages containers and container images, and kubelet which manages pods and node health.

If the volumeMount for containerd fails to mount successfully, the container images included in /var/lib/containerd are unavailable in the node and containers will be unable to pull their images to start.

If the volumeMount for kubelet fails to mount successfully, kubelet will be unable to start due to missing kubelet configuration files stored in /var/lib/kubelet.

Prior to VKS supervisor service 3.3 and clusterClass 3.3, workload cluster nodes use linux kernel logic for the management of volumeMounts.

Resolution

Workaround

The below workaround assumes that the affected workload cluster nodes had originally successfully mounted the volumeMounts on initial creation.

This workaround may not be effective on workload cluster nodes that did not have successful mounts originally or if there is an issue with the persistent volume itself.

Restart the affected workload cluster nodes to re-trigger the volumeMount mount operation successfully.

Confirm that the affected workload cluster nodes are expected to have a volumeMount for /var/lib/containerd and/or /var/lib/kubelet
SSH into the affected workload cluster node:
- See Connect to the TKG Service Cluster Control Plane as a Kubernetes Administrator
Confirm that there are issues with the volumeMounts or that the volumeMounts are not listed:
```
df -h
```
Perform a reboot of the node:
```
reboot now
```
Post-reboot, SSH into the affected workload cluster node and check that volumeMounts are mounted:
```
df -h
```
If containerd volumeMounts are still missing, restart containerd:
```
systemctl restart containerd
```
Confirm that containers are running or starting in the affected node:
```
crictl ps
```
Perform the above steps on each affected workload cluster node.

Resolution

It is recommended to upgrade the affected workload clusters to use builtin-generic-clusterClass 3.3 or higher.

This clusterClass version requires VKS supervisor service version 3.3 or higher.