Support for Node Volume Mounts for TKG Clusters in Supervisor

Products

VMware vSphere ESXi VMware vSphere with Tanzu

Issue/Introduction

Requirements to use Node Volume Mounts

-Requires TKR to be version 1.17 or later.
-Minimum of vCenter 7.0U2 and Supervisor Cluster Version 1.19.1
-This feature can only be used on TKG Clusters in Supervisor

Limitations

-Cannot change node volume mounts on Control Plane Nodes after cluster has already been deployed. If you need to add additional space to the control plane nodes, you MUST redeploy the TKC with the desired node volume mount.
-Not all node volume mounts are tested and some do not work at all or have unintended consequences. That is outlined below.

Fully Supported Mount Locations

-These node volume mounts are fully supported by VMware.

/var/lib/containerd - This increases the size available for cached images. ie the customer is using very large images for their deployment and deployments are taking a long time since it has to constantly be ejecting/pulling images due to the disk space being limited on the node.

/var/lib/kubelet - This increases the size available for ephemeral containers. ie the customer is using containers that require a lot of ephemeral storage and are getting errors on their pods about limited disk space.

NOTE: Both of the above node volume mounts CAN be added onto controllers, however since controllers are designed not to handle large workloads, we recommend revising your controllers footprint to be smaller vs just increasing the storage size to have a larger image or larger pod. In the event that this is unavoidable, there is nothing stopping these node volumes from being added to controllers.

Not Recommended Mount Locations

VMware does not recommend placing node volume mounts in these locations. Individual reasons for why the node volume mount is not recommended are noted below.

/var/lib/etcd - This has been assumed to increase the total size available for etcd, but that is not true. It will create a node volume mount on etcd, but this does not tell etcd to consume extra space during initialization. This means that regardless of the size of the node volume mount, once etcd has hit the 2GB max it will start to fail writes with "out of disk space" errors. On the bright side, kubernetes etcd database is designed to be small and even very large k8 deployments should not hit this value. When etcd has run out of space, the issue is always with an out of control CRD object's getting created/deleted in thousands every minute. The other reason we do not recommend the etcd node volume mount is because it will prevent cluster creation if the PVC backing the node volume mount either fails to create or takes too long to create due to slower storage.

NOTE: There is a known issue with node volume mounts on etcd for single node control plane VM's that CAN cause cluster data loss. If any existing single control plane clusters currently have an etcd node volume mount, they need to be moved to 3 to avoid data loss.

Unsupported Mount Locations

-During the node volume mount creation process everything is moved from the existing directory into a temp folder, then the additional storage is configured and mounted to the node, then finally everything is moved from the temp folder into the new storage. Any running service that is actively using files in the node volume mount location will not function during the creation process of the node volume mount, meaning that any directory that is being used by core system processes are not supported. This includes but are not limited to the following configured directories. Please use your best judgement when deciding where to use node volume mounts.

/ (root)
/var
/var/lib
/etc

Environment

VMware vSphere 8.0 with Tanzu
VMware vSphere 7.0 with Tanzu

Resolution

Known issues with Volume Mounts

TKGS Volume Mounts are not mounted after reboot on vSphere 7.0U2 or earlier (323441)
-Issue is fixed in vSphere 7.0U3 and all versions of 8.0

PODs stuck in ContainerCreating state on TKGS Guest Clusters after vSphere HA Failover event (319371)
-Issue is fixed in TKR v1.23.8---vmware.3-tkg.1

-TKGS Cluster stuck in upgrading state with error spec.kubeadmConfigSpec.mounts: Forbidden: cannot be modified (Please open support case with Broadcom for assistance referencing this issue)