New Machine Stuck Provisioned State due to Incorrect Volume Mapping/Disk Swap in vSphere Supervisor Workload Cluster with Volume Mount(s)

Products

vSphere with Tanzu VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

In a vSphere Supervisor environment, a workload cluster that uses one or more volume mounts for persistent storage has assigned volume mounts that are incorrectly mapped - indicating disk swap.

This occurs on creation of a new node in a nodepool that uses one or more volume mounts and can also affect control plane nodes if a volume mount is specified in the cluster YAML.

It can result in a machine stuck Provisioned state and volume mount sizing errors for the new node or machine.

The symptoms are different based on whether or not there is a single volume mount or multiple volume mounts.

Multiple Volume Mounts

While connected into the Supervisor cluster context, the following symptoms are observed:

When describing the affected Workload cluster, the affected node's nodepool has one or more volume mounts with different storage sizes assigned which will vary by environment:

kubectl describe cluster <affected cluster name> -n <affected cluster namespace>

volumes:
- name: containerd
mountPath: /var/lib/containerd
capacity:
  storage: 150Gi
- name: kubelet
mountPath: /var/lib/kubelet
capacity:
  storage: 50Gi

New nodes may remain stuck in Provisioned state with NodeDiskPressure.
- This may not affect all nodes of a cluster. Some nodes may successfully provision.
Nodes may fail to start up properly after reboot or rolling redeployment.

Describing the node's virtual machine shows an error message similar to the following, where <volume mount name> is one of the volume names specified in the nodepool as per above, such as kubelet or containerd:

kubectl describe vm <node stuck provisioned> -n <namespace>

Warning CreateOrUpdateFailure ##m (x# over ##m) 

vmware-system-vmop/vmware-system-vmop-controller-manager-<id>/virtualmachine-controller persistent volume:<node stuck provisioned>-<volume mount name> is not attached to the VM.

The logs of vmop controller show similar error messages to the below, where <volume mount name> is one of the volume names specified in the nodepool as per above, such as kubelet or containerd:

kubectl logs -n vmware-system-vmop <vmware-system-vmop-controller-manager pod>

"Failed to reconcile VirtualMachine" err="persistent volume: <node stuck provisioned>-<volume mount name> not attached to VM" logger="VirtualMachine"
"Reconciler error" err="persistent volume: <node stuck provisioned>-<volume mount name> not attached to VM" controller="virtualmachine"

While connected into the Workload cluster's context, the following symptoms may be observed:

Describing an affected node shows NodeDiskPressure.
Pods that use the affected volume mounts are Evicted due to NodeDiskPressure due to the aforementioned disk swap.

While SSH into one of the worker nodes in the nodepool with multiple volume mounts, the following symptoms are observed:

The disks are pointed to different mounts and have different sizes than the configuration set in the nodepool, indicating an incorrect mapping issue:
- The size values per mount are incorrect, or the mount may show 100% usage due to the aforementioned disk swap.
- Although the below example shows multiple volume mounts, the incorrect mapping issue can also occur when there is only one volume mount.
- In the below example, containerd (originally 150G) has been incorrectly mapped to kubelet (originally 50G) and vice-versa.
```
df -h | grep -Ei 'sdc|sdb'
```
```
Filesystem       Size         Used    Avail        Use%            Mounted On

/dev/sdb1        ###G         ###k   50G       #%            /var/lib/containerd
/dev/sdc1        ###G            ###k   150G       #%           /var/lib/kubelet
```

Single Volume Mount

While connected into the Supervisor cluster context, the following symptoms are observed:

The new machine is stuck in Provisioned state but has a ProviderID and the corresponding VM is poweredOn with an IP address assigned:
```
kubectl get vm,machine -o wide -n <workload cluster namespace>
```

When describing the affected Workload cluster, the affected node's nodepool has one volume mount, the size and name will vary by environment:

kubectl describe cluster <affected cluster name> -n <affected cluster namespace>

volumes:
- name: containerd
mountPath: <volume mount path>
capacity:
    storage: 
    <volume mount size>

Describing the node's virtual machine shows an error message similar to the following, where <volume mount name> is the volume mount specified in the nodepool as per above.

However, the volume mount may show as attached.

kubectl describe vm <node stuck provisioned> -n <namespace>

powerState: PoweredOn
uniqueID: <vm id>
  volumes:
  - attached: true
  diskUUID: <UUID>
  name: <vm name>-<volume mount name>

Events:
Warning CreateOrUpdateFailure ##m (x# over ##m) 
vmware-system-vmop/vmware-system-vmop-controller-manager-<id>/virtualmachine-controller persistent volume:<node stuck provisioned>-<volume mount name> is not attached to the VM.

While connected into the Workload cluster's context, the following symptoms are observed:

The new node is not present in the list of nodes:
```
kubectl get nodes
```

While SSH into the node stuck in Provisioned state, the following symptoms are observed:

The desired volume mount does not show mounted in the list output by the following command, however, a different mount is associated with /dev/sdb1:

df -h

Filesystem       Size       Used    Avail       Use%            Mounted On
/dev/sdb1       ###G        ###k    ##G         #%              <different mount>

- The intended mount for a nodepool's volumemount is /dev/sdb1
- For example, a new node may fail to reach Running state if the different mount using /dev/sdb1 is /boot/efi

Checking cloud-init-output logs show an error message similar to the following, indicating that the desired volume mount could not successfully mount to /dev/sdb1:

cat /var/log/cloud-init-output.log

[YYYY-MM-DD HH:MM:SS] + mount -t ext4 /dev/sdb1 <volume mount name>[YYYY-MM-DD HH:MM:SS] mount: <volume mount name>: /dev/sdb1 already mounted on <different mount>

Kubelet and containerd are not running:

systemctl status kubelet

systemctl status containerd

For details on volume mount support, please see the following KB: Support for Node Volume Mounts in Workload clusters

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

This issue can occur regardless of whether or not this cluster is managed by TMC

Cause

Prior to TKG Service 3.3.0, vSphere with Tanzu utilizes the Linux kernel to recognize disks.

However, the Linux kernel handles drive letter assignment asynchronously. In this scenario, the addition of more CPUs may cause more issues.

This can lead to volumes in clusters with nodepools becoming incorrectly mapped, effectively swapping the disk order when assigning /dev/sd* devices by name.

In extreme cases, the root disk may be mounted as /dev/sbd, where /dev/sbd1 is incorrectly formatted as an extra volume.

This issue is not limited to scenarios where there are multiple volume mounts in the same nodepool.

Incorrect mapping can also occur when there is only a single volume mount for the nodepool.

In this scenario, the intended mount for the single volume mount has been already assigned to another mount.

This causes cloud-init to fail at this stage which is before cloud-init starts kubelet and containerd in the node.

Kubelet and containerd are responsible for container, node and pod management of the node.

Windows Server 2022 is not affected by this issue.

Resolution

Upgrade to using TKG Service 3.3.0 and use the builtin-generic-3.3 ClusterClass to provision node pools with multiple volume mounts.

In TKG Service 3.3.0, the builtin-generic-3.3 ClusterClass uses the PCI and SCSI device mappings to assign and format disks consistently.

The Cluster Class section is necessary in a Cluster that is using the tanzukubernetes class or an older clusterclass version prior to using the builtin-generic-3.3 ClusterClass.

This section can be added in an existing workload cluster by updating its class to the latest builtin ClusterClass version through a workload cluster TKR upgrade:

Connect into the Supervisor cluster context

Edit the desired cluster to upgrade:

kubectl edit cluster <cluster name> -n <namespace>

Remove the following annotation:

kubernetes.vmware.com/skip-auto-cc-rebase

Follow the workload cluster upgrade steps to update the TKR version and trigger a rolling redeployment of all nodes in the upgrading cluster.

Workaround

There is not an effective solution without upgrading the TKG service version as noted above.

For multiple volume mounts:

Actions must be taken to minimize the impact caused by the incorrect mapping/disk swap.

One recommendation is to set all volume mounts to use the same storage size.

For single volume mount:

Allow the system to recreate the stuck Provisioned node.

The system will recreate stuck Provisioned nodes after node age of 120 minutes.

Additional Information

Windows Server 2022 is not affected by this issue.