In a vSphere Supervisor environment, a workload cluster that uses one or more volume mounts for persistent storage has assigned volume mounts that are incorrectly mapped - indicating disk swap.
This occurs on creation of a new node in a nodepool that uses one or more volume mounts and can also affect control plane nodes if a volume mount is specified in the cluster YAML.
It can result in a machine stuck Provisioned state and volume mount sizing errors for the new node or machine.
The symptoms are different based on whether or not there is a single volume mount or multiple volume mounts.
While connected into the Supervisor cluster context, the following symptoms are observed:
kubectl describe cluster <affected cluster name> -n <affected cluster namespace>
volumes:
- name: containerd
mountPath: /var/lib/containerd
capacity:
storage: 150Gi
- name: kubelet
mountPath: /var/lib/kubelet
capacity:
storage: 50Gi
kubectl describe vm <node stuck provisioned> -n <namespace>
Warning CreateOrUpdateFailure ##m (x# over ##m)
vmware-system-vmop/vmware-system-vmop-controller-manager-<id>/virtualmachine-controller persistent volume:<node stuck provisioned>-<volume mount name> is not attached to the VM.
kubectl logs -n vmware-system-vmop <vmware-system-vmop-controller-manager pod>
"Failed to reconcile VirtualMachine" err="persistent volume: <node stuck provisioned>-<volume mount name> not attached to VM" logger="VirtualMachine"
"Reconciler error" err="persistent volume: <node stuck provisioned>-<volume mount name> not attached to VM" controller="virtualmachine"
While connected into the Workload cluster's context, the following symptoms may be observed:
While SSH into one of the worker nodes in the nodepool with multiple volume mounts, the following symptoms are observed:
df -h | grep -Ei 'sdc|sdb'
Filesystem Size Used Avail Use% Mounted On
/dev/sdb1 ###G ###k 50G #% /var/lib/containerd
/dev/sdc1 ###G ###k 150G #% /var/lib/kubelet
While connected into the Supervisor cluster context, the following symptoms are observed:
kubectl get vm,machine -o wide -n <workload cluster namespace>
kubectl describe cluster <affected cluster name> -n <affected cluster namespace>
volumes:
- name: containerd
mountPath: <volume mount path>
capacity:
storage:
<volume mount size>
kubectl describe vm <node stuck provisioned> -n <namespace>
powerState: PoweredOn
uniqueID: <vm id>
volumes:
- attached: true
diskUUID: <UUID>
name: <vm name>-<volume mount name>
Events:
Warning CreateOrUpdateFailure ##m (x# over ##m)
vmware-system-vmop/vmware-system-vmop-controller-manager-<id>/virtualmachine-controller persistent volume:<node stuck provisioned>-<volume mount name> is not attached to the VM.
While connected into the Workload cluster's context, the following symptoms are observed:
kubectl get nodes
While SSH into the node stuck in Provisioned state, the following symptoms are observed:
df -h
Filesystem Size Used Avail Use% Mounted On
/dev/sdb1 ###G ###k ##G #% <different mount>
cat /var/log/cloud-init-output.log
[YYYY-MM-DD HH:MM:SS] + mount -t ext4 /dev/sdb1 <volume mount name>[YYYY-MM-DD HH:MM:SS] mount: <volume mount name>: /dev/sdb1 already mounted on <different mount>
systemctl status kubelet
systemctl status containerd
For details on volume mount support, please see the following KB: Support for Node Volume Mounts in Workload clusters
vSphere 7.0 with Tanzu
vSphere 8.0 with Tanzu
This issue can occur regardless of whether or not this cluster is managed by TMC
Prior to TKG Service 3.3.0, vSphere with Tanzu utilizes the Linux kernel to recognize disks.
However, the Linux kernel handles drive letter assignment asynchronously. In this scenario, the addition of more CPUs may cause more issues.
This can lead to volumes in clusters with nodepools becoming incorrectly mapped, effectively swapping the disk order when assigning /dev/sd* devices by name.
In extreme cases, the root disk may be mounted as /dev/sbd, where /dev/sbd1 is incorrectly formatted as an extra volume.
This issue is not limited to scenarios where there are multiple volume mounts in the same nodepool.
Incorrect mapping can also occur when there is only a single volume mount for the nodepool.
In this scenario, the intended mount for the single volume mount has been already assigned to another mount.
This causes cloud-init to fail at this stage which is before cloud-init starts kubelet and containerd in the node.
Kubelet and containerd are responsible for container, node and pod management of the node.
Windows Server 2022 is not affected by this issue.
Upgrade to using TKG Service 3.3.0 and use the builtin-generic-3.3 ClusterClass to provision node pools with multiple volume mounts.
In TKG Service 3.3.0, the builtin-generic-3.3 ClusterClass uses the PCI and SCSI device mappings to assign and format disks consistently.
The Cluster Class section is necessary in a Cluster that is using the tanzukubernetes class or an older clusterclass version prior to using the builtin-generic-3.3 ClusterClass.
This section can be added in an existing workload cluster by updating its class to the latest builtin ClusterClass version through a workload cluster TKR upgrade:
kubectl edit cluster <cluster name> -n <namespace>
kubernetes.vmware.com/skip-auto-cc-rebase
There is not an effective solution without upgrading the TKG service version as noted above.
For multiple volume mounts:
Actions must be taken to minimize the impact caused by the incorrect mapping/disk swap.
One recommendation is to set all volume mounts to use the same storage size.
For single volume mount:
Allow the system to recreate the stuck Provisioned node.
The system will recreate stuck Provisioned nodes after node age of 120 minutes.
Windows Server 2022 is not affected by this issue.