After a power-outage, one or more workload clusters are unhealthy and several virtual services are not coming up. These virtual services are run by pods within the affected workload clusters.
From the Supervisor cluster context, the affected workload clusters have a configuration where volumeMounts are defined for /var/lib/containerd and/or /var/lib/kubelet.
kubectl describe cluster <cluster name> -n <cluster namespace>
From within the workload cluster context, the affected workload cluster nodes show that pods are running as per kubectl, but there are no containers actually running on the node:
kubectl get pods -A -o wide | grep <node name>
crictl ps
df -h
crictl images list
systemctl status kubelet
journalctl -xeu kubelet
msg="Failed to load kubelet config file, path: /var/lib/kubelet/config.yaml", error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file \"/var/lib/kubelet/config.yaml\"
vSphere Supervisor
This issue can occur on any workload cluster regardless of management by Tanzu Mission Control (TMC)
Workload cluster configured with volumeMount for /var/lib/containerd and /var/lib/kubelet
ClusterClass 3.2 and below
Containers rely on the functionality of system service containerd which manages containers and container images, and kubelet which manages pods and node health.
If the volumeMount for containerd fails to mount successfully, the container images included in /var/lib/containerd are unavailable in the node and containers will be unable to pull their images to start.
If the volumeMount for kubelet fails to mount successfully, kubelet will be unable to start due to missing kubelet configuration files stored in /var/lib/kubelet.
Prior to VKS supervisor service 3.3 and clusterClass 3.3, workload cluster nodes use linux kernel logic for the management of volumeMounts.
The below workaround assumes that the affected workload cluster nodes had originally successfully mounted the volumeMounts on initial creation.
This workaround may not be effective on workload cluster nodes that did not have successful mounts originally or if there is an issue with the persistent volume itself.
Restart the affected workload cluster nodes to re-trigger the volumeMount mount operation successfully.
df -h
reboot now
df -h
systemctl restart containerd
crictl ps
It is recommended to upgrade the affected workload clusters to use builtin-generic-clusterClass 3.3 or higher.
This clusterClass version requires VKS supervisor service version 3.3 or higher.