This KB article documents steps to take prior to workload cluster upgrades meeting the following criteria:
If these steps are not taken, a workload cluster upgrade can become stuck on creating the control plane nodes on the desired KR 1.33.X version.
vSphere Supervisor
Originally deployed on vSphere 7 or 8.0 GA then later directly upgraded to vSphere 8.0u2 (or higher), skipping vSphere 8.0u1
Workload cluster upgrade from KR 1.32.x to KR 1.33.x
Workload cluster upgrade including the clusterClass rebase to clusterClass v3.3.0 or higher
Previous versions of vSphere Supervisor (formerly known as vSphere with Tanzu / Tanzu Kubernetes Grid Service) would directly manage KubeadmControlPlane resources, identifying itself only as "manager".
From vSphere 8.0 U1, KubeadmControlPlane resources are now managed by Kubernetes Cluster API. Kubernetes tracks which components "own" which fields on resources through Server-Side Apply. This ownership tracking acts as a ledger that records which component controls each piece of data, preventing different components from accidentally overwriting each other's changes. When multiple components set the same field to the same value, they can share "co-ownership" of that field. In such cases, it's not sufficient for just one component to stop managing the field; the other owner(s) will continue to maintain their ownership and the value.
When components disagree on a field's value, the Kubernetes API server either rejects the request (to prevent conflicts), or one component can use "force-ownership" to take control. However, force-ownership is typically used sparingly as it can disrupt automated management by other components.
Former versions of Kubernetes Cluster API included cleanup logic to remove old field ownership records during the transition to Server-Side Apply. This cleanup was included in vSphere 8.0 U1. However, this transition path was later removed from Cluster API. As a result, if vSphere 8.0 U1 is not included in the upgrade path, legacy field ownership entries (with API version v1beta1 and manager name manager) remain on KubeadmControlPlane resources. These legacy entries create co-ownership situations that prevent the Cluster API topology reconciler from updating certain fields.
Important: The topology reconciler does not use force-ownership to resolve these conflicts - it is designed to respect field ownership boundaries and will not forcibly take control. This is by design to prevent disrupting other components' management. As a result, when legacy ownership entries block the reconciler from updating fields needed for an upgrade, the topology reconciler does not update these, and the upgrade can stall.
Note: This issue occurs specifically on vSphere 8.0u2 or higher environments where the upgrade to vSphere 8.0u1 was skipped.
Resolution
This issue is fixed in vSphere Kubernetes Service 3.5.0, and the workaround does not need to be applied if environments are upgraded to VKS 3.5.0 prior to cluster upgrades.
See Additional Information for steps to take before upgrading workload clusters meeting the criteria in the Issue/Introduction.
Workaround
If this issue has been encountered and the workload cluster upgrade is stuck, see the following KB article which matches the symptoms you are encountering:
error applying: error applying task <service-name>.mount: unit with key <mount-name>.mount could not be enabled
Unit /run/systemd/generator/<mount-name>.mount is transient or generated.
Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
See KB: Workload Cluster Upgrade Stuck to builtin-generic-v3.3.0 clusterclass due to Volume Mount Conflicts
crictl ps -a --name kube-apiserver
CONTAINER IMAGE CREATED STATE NAME
<api-server-container-id> <IMAGE ID> <creation time> Exited kube-apiserver
crictl logs <api-server-container-id>
Error: unknown flag: --cloud-provider
To avoid this issue for future workload cluster upgrades in vSphere Supervisor environment that skipped vSphere 8.0u1, the attached script can be run on any Supervisor control plane VM before initiating a workload cluster upgrade. This script will clean up legacy fields described in the Cause section of this KB.
chmod +x remove-legacy-managed-fields.py
# Produce a report of affected resources (dry-run mode)
./remove-legacy-managed-fields.py -A --dry-run
# Fix a specific cluster (safest option)
./remove-legacy-managed-fields.py --namespace <namespace> --cluster <cluster-name>
# Fix all clusters in a specific namespace
./remove-legacy-managed-fields.py --namespace <namespace>
# Fix all clusters across all namespaces (requires confirmation)
./remove-legacy-managed-fields.py -A
# Generate a JSON report
./remove-legacy-managed-fields.py -A --dry-run --report report.json
# Skip confirmation prompts (WARNING: Use with caution!)
./remove-legacy-managed-fields.py --namespace <namespace> --yes
Notes:
What the script removes:
The script only removes problematic managedFields entries that prevent upgrades.
Specifically, it removes entries where ALL of these conditions are met:
What the script does NOT remove:
The following types of managedFields entries are normal and expected, and will NOT be removed by the script:
For example, this type of entry is normal and will be preserved:
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
manager: manager
operation: Update
subresource: status
The script is designed to be surgical - it only removes the specific v1beta1 main resource entries that cause upgrade conflicts.