Creating a new VKS cluster with Ubuntu gets stuck or after scaling up a worker node-pool, the new Ubuntu worker machines/nodes are stuck in NotReady state.
The affected VKS cluster node-pools have additional volumes such as /var/lib/containerd and/or /var/lib/kubelet.
kubectl describe cluster -n <namespace> <cluster name>
- name: containerd
mountPath: /var/lib/containerd
...
- name: kubelet
mountPath: /var/lib/kubelet
While SSH to one of the stuck NotReady worker nodes, the following symptoms are observed:
cat /var/log/cloud-init-output.log
[YYYY-MM-DD HH:MM:SS] {"time":"...","level":"ERROR","msg":"error applying: error applying task mount-var-lib-<volume>: error mounting disk to [/var/lib/<volume>]: could not format disk at [/dev/disk/by-path/-part1] to ext4: mke2fs 1.47.0
error mounting disk to [/var/lib/<volume>]: could not format partition [/dev/disk/by-path/...] to ext4: mke2fs 1.47.0 No such file or directory while setting up superblock: exit status 1
The file /dev/disk/by-path/-part1 does not exist and no size was specified.\n: exit status 1"}
systemctl status kubelet
journalctl -xeu kubelet
"command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
The unit kubelet.service has entered the 'failed' state with result 'exit-code'.
systemd[1]: kubelet.service: Scheduled restart jobvSphere Supervisor
vSphere Supervisor Service (VKS) 3.5.0 and higher
During the initial cloud-init bootstrapping of a new ubuntu node, the system fails to format a new partition dedicated to the configured additional volume. This is caused the underlying mkfs utility failing when to the device node or its symlink is not yet fully populated or stabilized in the OS.
Because the volume mounting failed during cloud-init, the kubelet configuration files are never generated.
The kubelet service will repeatedly fail to start, reporting that config.yaml is missing.
Resolution
This is a known issue that will be fixed in an upcoming new version of VKS.
Workaround
Instruct the system to remediate and recreate the stuck NotReady machines.
kubectl get cluster -n <VKS cluster namespace> <VKS cluster name> -o yaml | grep -i paused
kubectl get machines -n <VKS cluster namespace> | grep -i NotReady
kubectl annotate machine -n <VKS cluster namespace> <NotReady machine> 'cluster.x-k8s.io/remediate-machine=""'