Steps to Perform Before Upgrading to ClusterClass 3.3.0 or KR 1.32.X to 1.33.X

Products

VMware vSphere Kubernetes Service

Issue/Introduction

This KB article documents steps to take prior to workload cluster upgrades meeting the following criteria:

Workload cluster upgrade from KR v1.32.x to v1.33.x
Workload cluster upgrade including clusterClass rebase to clusterClass v3.3.0 or higher

If these steps are not taken, a workload cluster upgrade can become stuck on creating the control plane nodes on the desired KR 1.33.X version.

Environment

vSphere Supervisor

Originally deployed on vSphere 7 or 8.0 GA then later directly upgraded to vSphere 8.0u2 (or higher), skipping vSphere 8.0u1

Workload cluster upgrade from KR 1.32.x to KR 1.33.x

Workload cluster upgrade including the clusterClass rebase to clusterClass v3.3.0 or higher

Cause

Previous versions of vSphere Supervisor (formerly known as vSphere with Tanzu / Tanzu Kubernetes Grid Service) would directly manage KubeadmControlPlane resources, identifying itself only as "manager".

From vSphere 8.0 U1, KubeadmControlPlane resources are now managed by Kubernetes Cluster API. Kubernetes tracks which components "own" which fields on resources through Server-Side Apply. This ownership tracking acts as a ledger that records which component controls each piece of data, preventing different components from accidentally overwriting each other's changes. When multiple components set the same field to the same value, they can share "co-ownership" of that field. In such cases, it's not sufficient for just one component to stop managing the field; the other owner(s) will continue to maintain their ownership and the value.

When components disagree on a field's value, the Kubernetes API server either rejects the request (to prevent conflicts), or one component can use "force-ownership" to take control. However, force-ownership is typically used sparingly as it can disrupt automated management by other components.

Former versions of Kubernetes Cluster API included cleanup logic to remove old field ownership records during the transition to Server-Side Apply. This cleanup was included in vSphere 8.0 U1. However, this transition path was later removed from Cluster API. As a result, if vSphere 8.0 U1 is not included in the upgrade path, legacy field ownership entries (with API version v1beta1 and manager name manager) remain on KubeadmControlPlane resources. These legacy entries create co-ownership situations that prevent the Cluster API topology reconciler from updating certain fields.

Important: The topology reconciler does not use force-ownership to resolve these conflicts - it is designed to respect field ownership boundaries and will not forcibly take control. This is by design to prevent disrupting other components' management. As a result, when legacy ownership entries block the reconciler from updating fields needed for an upgrade, the topology reconciler does not update these, and the upgrade can stall.

Resolution

Note: This issue occurs specifically on vSphere 8.0u2 or higher environments where the upgrade to vSphere 8.0u1 was skipped.

Resolution

This issue is fixed in vSphere Kubernetes Service 3.5.0, and the workaround does not need to be applied if environments are upgraded to VKS 3.5.0 prior to cluster upgrades.

See Additional Information for steps to take before upgrading workload clusters meeting the criteria in the Issue/Introduction.

Workaround

If this issue has been encountered and the workload cluster upgrade is stuck, see the following KB article which matches the symptoms you are encountering:

A new control plane node on the desired KR version is stuck Provisioning or Provisioned.
The /var/log/cloud-init-output logs on the new control plane node show:
```
error applying: error applying task <service-name>.mount: unit with key <mount-name>.mount could not be enabled
Unit /run/systemd/generator/<mount-name>.mount is transient or generated.
Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
```
See KB: Workload Cluster Upgrade Stuck to builtin-generic-v3.3.0 clusterclass due to Volume Mount Conflicts
The new control plane node on the desired KR is stuck Provisioned.
On this new control plane node, The kube-apiserver is failed in Exited state and kube-apiserver logs show an unknown flag error for cloud-provider:
```
crictl ps -a --name kube-apiserver
CONTAINER                 IMAGE        CREATED       STATE       NAME
<api-server-container-id> <IMAGE ID>  <creation time>  Exited     kube-apiserver

crictl logs <api-server-container-id>
Error: unknown flag: --cloud-provider
```
See KB: Workload Cluster Upgrade from KR 1.32.x to 1.33.1 Stalls on First Control Plane Node Provisioned

Additional Information

Future Workload Cluster Upgrades

To avoid this issue for future workload cluster upgrades in vSphere Supervisor environment that skipped vSphere 8.0u1, the attached script can be run on any Supervisor control plane VM before initiating a workload cluster upgrade. This script will clean up legacy fields described in the Cause section of this KB.

Upload the attached script to a Supervisor control plane VM
SSH to that Supervisor control plane VM
- See "How to SSH into Supervisor Control Plane VMs" in Troubleshooting vSphere Supervisor Control Plane VMs
Change permissions on the uploaded script to executable:
```
chmod +x remove-legacy-managed-fields.py
```

See the below for example commands for using the script:

# Produce a report of affected resources (dry-run mode)
./remove-legacy-managed-fields.py -A --dry-run

# Fix a specific cluster (safest option)
./remove-legacy-managed-fields.py --namespace <namespace> --cluster <cluster-name>

# Fix all clusters in a specific namespace
./remove-legacy-managed-fields.py --namespace <namespace>

# Fix all clusters across all namespaces (requires confirmation)
./remove-legacy-managed-fields.py -A

# Generate a JSON report
./remove-legacy-managed-fields.py -A --dry-run --report report.json

# Skip confirmation prompts (WARNING: Use with caution!)
./remove-legacy-managed-fields.py --namespace <namespace> --yes

Notes:

The script uses kubectl-style arguments (-A for all namespaces, like kubectl -A)
Each resource is marked with an annotation after cleanup to prevent re-processing
The script is idempotent, so it safe to run it multiple times.
Use --dry-run to see what would be changed without making modifications
Use --verbose for detailed logging output
In the scenario of multiple workload clusters, confirmation can be auto-skipped with the "--yes" flag:

What the script removes:

The script only removes problematic managedFields entries that prevent upgrades.

Specifically, it removes entries where ALL of these conditions are met:

apiVersion is controlplane.cluster.x-k8s.io/v1beta1
manager is manager OR before-first-apply
operation is Update
subresource is empty (not a subresource update)

What the script does NOT remove:

The following types of managedFields entries are normal and expected, and will NOT be removed by the script:

Entries with a newer apiVersion (not v1beta1)
Entries for the status subresource (subresource: status)
Entries that only manage metadata.finalizers or metadata.ownerReferences

For example, this type of entry is normal and will be preserved:

- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
  manager: manager
  operation: Update
  subresource: status

The script is designed to be surgical - it only removes the specific v1beta1 main resource entries that cause upgrade conflicts.

Attachments

remove-legacy-managed-fields.py get_app