VKS Cluster Ubuntu Worker Nodes stuck NotReady when configured with Additional Volume

search cancel

VKS Cluster Ubuntu Worker Nodes stuck NotReady when configured with Additional Volume

book

Article ID: 437461

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

Creating a new VKS cluster with Ubuntu gets stuck or after scaling up a worker node-pool, the new Ubuntu worker machines/nodes are stuck in NotReady state.

The affected VKS cluster node-pools have additional volumes such as /var/lib/containerd and/or /var/lib/kubelet.

kubectl describe cluster -n <namespace> <cluster name>

- name: containerd
  mountPath: /var/lib/containerd
...
- name: kubelet
  mountPath: /var/lib/kubelet

While SSH to one of the stuck NotReady worker nodes, the following symptoms are observed:

The cloud-init-output logs shows errors similar to the following mounting the additional volume's partition, where <volume> is the additional volume configured above:

cat /var/log/cloud-init-output.log

[YYYY-MM-DD HH:MM:SS] {"time":"...","level":"ERROR","msg":"error applying: error applying task mount-var-lib-<volume>: error mounting disk to [/var/lib/<volume>]: could not format disk at [/dev/disk/by-path/-part1] to ext4: mke2fs 1.47.0

error mounting disk to [/var/lib/<volume>]: could not format partition [/dev/disk/by-path/...] to ext4: mke2fs 1.47.0 No such file or directory while setting up superblock: exit status 1

The file /dev/disk/by-path/-part1 does not exist and no size was specified.\n: exit status 1"}

Kubelet service is unhealthy and restarting repeatedly with the following errors referencing missing the config.yaml file:

systemctl status kubelet

journalctl -xeu kubelet

"command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
 
The unit kubelet.service has entered the 'failed' state with result 'exit-code'. 
systemd[1]: kubelet.service: Scheduled restart job

Environment

vSphere Supervisor

vSphere Supervisor Service (VKS) 3.5.0 and higher

Cause

During the initial cloud-init bootstrapping of a new ubuntu node, the system fails to format a new partition dedicated to the configured additional volume. This is caused the underlying mkfs utility failing when to the device node or its symlink is not yet fully populated or stabilized in the OS.

Because the volume mounting failed during cloud-init, the kubelet configuration files are never generated.

The kubelet service will repeatedly fail to start, reporting that config.yaml is missing.

Resolution

Resolution

This is a known issue that will be fixed in an upcoming new version of VKS.

Workaround

Instruct the system to remediate and recreate the stuck NotReady machines.

Connect to the Supervisor cluster context
Confirm that the corresponding VKS cluster is not paused, the below command should return paused: false:
```
kubectl get cluster -n <VKS cluster namespace> <VKS cluster name> -o yaml | grep -i paused
```

Retrieve the names of the stuck NotReady machines:

kubectl get machines -n <VKS cluster namespace> | grep -i NotReady

Annotate the machines so that the system will remediate and create them:

kubectl annotate machine -n <VKS cluster namespace> <NotReady machine> 'cluster.x-k8s.io/remediate-machine=""'

Monitor the recreation of the nodes.

Feedback

thumb_up Yes

thumb_down No