In a vSphere Supervisor environment, when upgrading an existing workload cluster from KR 1.32.x to 1.33.x, you may notice that the upgrade stalls because the first control plane VM fails to initialize successfully. In this situation, the below commands will demonstrate that one control plane node is stuck in a provisioned state, though with the updated version (in examples below, 1.33.1+vmware.1-fip).
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get machine -n <affected cluster namespace> | grep <affected workload cluster-name>
cluster-namespace <control plane node 1> affected workload cluster-name Running #d#h v1.32.3+vmware.1-fips
cluster-namespace <control plane node 2> affected workload cluster-name Provisioned #m v1.33.1+vmware.1-fip
kubectl get machine -n <affected cluster namespace> | grep <affected workload cluster-name>
cluster-namespace <control plane node A> affected workload cluster-name Running #d#h v1.32.3+vmware.1-fips
cluster-namespace <control plane node B> affected workload cluster-name Running #d#h v1.32.3+vmware.1-fips
cluster-namespace <control plane node C> affected workload cluster-name Running #d#h v1.32.3+vmware.1-fips
cluster-namespace <control plane node D> affected workload cluster-name Provisioned #m v1.33.1+vmware.1-fipNote: Workload cluster upgrades always begin with control plane nodes and will not proceed to the worker nodes until all control plane nodes have successfully upgraded.
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/<kcp name> affected workload cluster-name true true 4 3 1 1 2y357d v1.33.1+vmware.1-fips
[2026-03-23 13:54:23] I0323 13:54:23.285970 4155 kubelet.go:74] attempting to download the KubeletConfiguration from ConfigMap "kubelet-config"
[2026-03-23 13:54:23] W0323 13:54:23.291008 4155 reset.go:141] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get node registration: failed to get corresponding node: nodes "<control plane node D>" not found.
While SSH'd into the stuck new control plane node on the desired KR 1.33 version:
The kube-apiserver is failing in Exited state and the following error is observed in the new control plane node's kube-apiserver logs:
crictl ps -a --name kube-apiserver
CONTAINER IMAGE CREATED STATE NAME
<api-server-container-id> <IMAGE ID> <creation time> Exited kube-apiserver
crictl logs <api-server-container-id>
Error: unknown flag: --cloud-providerIn this situation, crictl ps does not show any processes Running and kubelet service is failing to start:
systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (exited) since DAY YYYY-MM-DD HH:MM:DD UTC; s/min agovSphere Supervisor
VKS 3.4 or above
The --cloud-provider API server flag was removed starting with Kubernetes v1.33.0.
If this flag is still present in a cluster’s configuration, the API server will fail to initialize due to clashing fields in the KubeadmControlPlane resource. A detailed explanation is given in KB: Steps to Perform Before Upgrading to ClusterClass 3.3.0 or KR 1.32.X to 1.33.X
This issue is fixed in vSphere Kubernetes Service 3.5.0, and the procedure below does not need to be carried out if the service is upgraded to v3.5.0 before the cluster is upgraded.
To resolve this issue, remove the deprecated --cloud-provider=external flag from the cluster configuration after the upgrade has stalled.
All commands below are executed from the supervisor cluster context.
Procedure
Connect into the Supervisor cluster context
Identify the cluster that is stuck and note its name:
kubectl get cluster <affected cluster name> -n <affected cluster namespace>
Download the "mitigate-managed-fields" script attached to this KB and move it to one of the Supervisor control plane VMs.
SSH to the Supervisor Control Plane VM that has the script:
See "How to SSH into Supervisor Control Plane VMs" in KB Troubleshooting vSphere Supervisor Control Plane VMs
python mitigate-managed-fields.py --namespace <affected cluster namespace> --cluster <affected cluster name> --dry-run -vpython mitigate-managed-fields.py --namespace <affected cluster namespace> --cluster <affected cluster name> -vkubectl get kcp -n <affected cluster namespace>
kubectl edit kcp -n <affected cluster namespace> <affected cluster name>-<kcp id>
spec:
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
extraArgs:
admission-control-config-file: /etc/kubernetes/extra-config/admission-control-config.yaml
allow-privileged: "true"
audit-log-maxage: "30"
audit-log-maxbackup: "10"
audit-log-maxsize: "100"
audit-log-path: /var/log/kubernetes/kube-apiserver.log
audit-policy-file: /etc/kubernetes/extra-config/audit-policy.yaml
authentication-token-webhook-config-file: /webhook/webhook-config.yaml
authentication-token-webhook-version: v1
client-ca-file: /etc/ssl/certs/extensions-tls.crt
cloud-provider: external
First, identify the machine:
kubectl get machine -n <workload cluster namespace> | grep Provisioned
cluster-namespace <stuck provisioned machine> affected workload cluster-name Provisioned #m v1.33.1+vmware.1-fip
Then, annotate the machine to instruct the system to recreate it:
kubectl annotate machine -n <workload cluster namespace> <stuck provisioned machine> 'cluster.x-k8s.io/remediate-machine=""'
Confirm that the control plane node recreates and reaches Running state:
kubectl get machine -n <workload cluster namespace>(In some instances, the stuck Machine resource is quickly deleted after editing the kcp, and before any user has time to annotate the machine for remediation.)
After all of the above steps are followed, the upgrade will progress as expected.
Preventing the issue on other clusters on vSphere Kubernetes Service v.3.4.0 and below
Follow the steps in KB: Steps to Perform Before Upgrading to ClusterClass 3.3.0 or KR 1.32.X to 1.33.X