Steps to Perform Before Upgrading to ClusterClass 3.3.0 or KR 1.32.X to 1.33.X

Products

VMware vSphere Kubernetes Service

Issue/Introduction

This KB article documents steps to take prior to workload cluster upgrades meeting the following criteria:

Workload cluster upgrade from KR v1.32.x to v1.33.x
Workload cluster upgrade including clusterClass rebase to clusterClass v3.3.0 or higher

If these steps are not taken, a workload cluster upgrade can become stuck on creating the control plane nodes on the desired KR 1.33.X version.

Environment

vSphere Supervisor

Originally deployed on vSphere 7 or 8.0 GA then later directly upgraded to vSphere 8.0u2 (or higher), skipping vSphere 8.0u1

Workload cluster upgrade from KR 1.32.x to KR 1.33.x

Workload cluster upgrade including the clusterClass rebase to clusterClass v3.3.0 or higher

Cause

Previous versions of vSphere Supervisor (formerly known as vSphere with Tanzu / Tanzu Kubernetes Grid Service) would directly manage KubeadmControlPlane resources, identifying itself only as "manager".

From vSphere 8.0 U1, KubeadmControlPlane resources are now managed by Kubernetes Cluster API. Kubernetes tracks which components "own" which fields on resources through Server-Side Apply. This ownership tracking acts as a ledger that records which component controls each piece of data, preventing different components from accidentally overwriting each other's changes. When multiple components set the same field to the same value, they can share "co-ownership" of that field. In such cases, it's not sufficient for just one component to stop managing the field; the other owner(s) will continue to maintain their ownership and the value.

When components disagree on a field's value, the Kubernetes API server either rejects the request (to prevent conflicts), or one component can use "force-ownership" to take control. However, force-ownership is typically used sparingly as it can disrupt automated management by other components.

Former versions of Kubernetes Cluster API included cleanup logic to remove old field ownership records during the transition to Server-Side Apply. This cleanup was included in vSphere 8.0 U1. However, this transition path was later removed from Cluster API. As a result, if vSphere 8.0 U1 is not included in the upgrade path, legacy field ownership entries (with API version v1beta1 and manager name manager) remain on KubeadmControlPlane resources. These legacy entries create co-ownership situations that prevent the Cluster API topology reconciler from updating certain fields.

Important: The topology reconciler does not use force-ownership to resolve these conflicts - it is designed to respect field ownership boundaries and will not forcibly take control. This is by design to prevent disrupting other components' management. As a result, when legacy ownership entries block the reconciler from updating fields needed for an upgrade, the topology reconciler does not update these, and the upgrade can stall.

Resolution

Note: This issue occurs specifically on vSphere 8.0u2 or higher environments where the upgrade to vSphere 8.0u1 was skipped.

Resolution

This issue is fixed in vSphere Kubernetes Service 3.5.0, however in some edge-cases it may appear. Fixes for those edge cases will come in a future VKS release

The workaround does not need to be applied if environments are upgraded to VKS 3.5.0 prior to cluster upgrades.

See Additional Information for steps to take before upgrading workload clusters meeting the criteria in the Issue/Introduction.

Workaround

If this issue has been encountered and the workload cluster upgrade is stuck, see the following KB article which matches the symptoms you are encountering:

A new control plane node on the desired KR version is stuck Provisioning or Provisioned.
The /var/log/cloud-init-output logs on the new control plane node show:
```
error applying: error applying task <service-name>.mount: unit with key <mount-name>.mount could not be enabled
Unit /run/systemd/generator/<mount-name>.mount is transient or generated.
Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
```
See KB: Workload Cluster Upgrade Stuck to builtin-generic-v3.3.0 clusterclass due to Volume Mount Conflicts
The new control plane node on the desired KR is stuck Provisioned.
On this new control plane node, The kube-apiserver is failed in Exited state and kube-apiserver logs show an unknown flag error for cloud-provider:
```
crictl ps -a --name kube-apiserver
CONTAINER                 IMAGE        CREATED       STATE       NAME
<api-server-container-id> <IMAGE ID>  <creation time>  Exited     kube-apiserver

crictl logs <api-server-container-id>
Error: unknown flag: --cloud-provider
```
See KB: Workload Cluster Upgrade from KR 1.32.x to 1.33.1 Stalls on First Control Plane Node Provisioned

Additional Information

Future Workload Cluster Upgrades

To avoid this issue for future workload cluster upgrades in vSphere Supervisor environment that skipped vSphere 8.0u1, the attached script can be run on any Supervisor control plane VM before initiating a workload cluster upgrade. This script will mitigate problematic fields described in the Cause section of this KB.

Upload the attached script to a Supervisor control plane VM
SSH to that Supervisor control plane VM
- See "How to SSH into Supervisor Control Plane VMs" in Troubleshooting vSphere Supervisor Control Plane VMs

See the below for example commands for using the script:

# Produce a report of affected resources (dry-run mode)
python mitigate-managed-fields.py --namespace <affected cluster namespace> --cluster <affected cluster name> --dry-run

# Produce a report of affected resources (dry-run mode) with verbosity
python mitigate-managed-fields.py --namespace <affected cluster namespace> --cluster <affected cluster name> --dry-run --verbose

# Fix a specific cluster (safest option)
python mitigate-managed-fields.py --namespace <affected cluster namespace> --cluster <affected cluster name>

# Generate a JSON report
python mitigate-managed-fields.py --namespace <affected cluster namespace> --cluster <affected cluster name> --dry-run --report report.json

What the script removes:

The script only mitigates problematic managedFields entries that prevent upgrades.

Specifically, it removes the entry manager: before-first-apply, if present.

Or it adds manager: capi-topology, if managedFields is empty.

What the script does NOT remove:

The following types of managedFields entries are normal and expected, and will NOT be removed by the script:

Entries with a newer apiVersion (not v1beta1)
Entries for the status subresource (subresource: status)
Entries that only manage metadata.finalizers or metadata.ownerReferences

For example, this type of entry is normal and will be preserved:

- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
  manager: manager
  operation: Update
  subresource: status

The script is designed to be surgical - it only removes the specific entries that cause upgrade conflicts.

Attachments

mitigate-managed-fields.py get_app