A workload cluster upgrade is stuck where new control plane machine on the desired KR version is stuck Provisioning or Provisioned state.
Upgrades begin with upgrading all control plane machines to the desired KR version.
Note: Upgrades do not progress to upgrading the worker machines until all control plane machines are healthy on the desired KR version.
While SSH directly to the stuck Provisioned/Provisioning new control plane machine, error messages similar to the following are present:
cat /var/log/cloud-init-output.log
error applying: error applying task .mount: unit with key .mount could not be enabled
Unit /run/systemd/generator/.mount is transient or generated.
Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
See the following documentation: SSH to TKG Service Cluster Nodes as the System User Using a Password
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get cluster -n <affected workload cluster namespace>
kubectl get machines -n <affected workload cluster namespace>
kubectl describe machine -n <workload cluster namespace> <stuck Provisioning node name>
* VirtualMachineProvisioned: Provisioning
* NodeHealthy: Waiting for VSphereMachine to report spec.providerID
machinehealthcheck-controller Machine <workload cluster namespace>/<stuck Provisioning node name> has unhealthy Node
kubectl get kcp -n <affected workload cluster namespace>
kubectl describe kcp -n <affected workload cluster namespace> <workload cluster kcp name>
Message: Rolling 3 replicas with outdated spec (1 replicas up to date)
Message: * Machine <stuck Provisioning control plane node>
* EtcdMemberHealthy: Waiting for a Node with spec.providerID vsphere://<provider ID> to exist
* Etcd member <stuck Provisioning control plane node> does not have a corresponding Machine
Message: Scaling down control plane to 3 replicas (actual 4)
vSphere Supervisor
Originally deployed on vSphere 7 or 8.0 GA then later directly upgraded to vSphere 8.0u2 (or higher), skipping vSphere 8.0u1
Workload Cluster using or upgrading to builtin-generic-v3.3.0 (or higher) clusterClass
This issue is related to legacy field ownership conflicts from Server-Side Apply. Further details can be found in KB Workload Cluster Upgrade Stuck on New Node Stuck Provisioning due to Legacy Fields
In this scenario, the workload cluster's KubeadmControlPlane resource contains:
This dual configuration occurs because of the following factors:
The systemd error occurs because the mount unit is being defined in two different ways which creates a conflict:
Note: This issue occurs specifically on vSphere 8.0u2 or higher environments where the upgrade to vSphere 8.0u1 was skipped.
Resolution
This issue is fixed in vSphere Kubernetes Service v3.5.0.
Once the environment is on VKS v3.5.0, the below workaround does not need to be applied to any workload clusters that haven't been upgraded yet.
For Future Workload Cluster Upgrades
See the Additional Information section from the below KB article:
Workload Cluster Upgrade Stuck on New Node Stuck Provisioning due to Legacy Fields
For Workload Clusters Stuck Upgrading Affected By This Issue - Workaround
kubectl patch cluster.v1beta1.cluster.x-k8s.io <workload cluster name> -n <workload cluster namespace> --type=merge -p='{"spec":{"topology":{"controlPlane":{"machineHealthCheck":{"enable":false}}}}}'
kubectl get cluster.v1beta1.cluster.x-k8s.io <workload cluster name> -n <workload cluster namespace> -o yaml | grep -A2 machineHealthCheck
kubectl get kcp -n <workload cluster namespace>
kubectl get kcp <kcp name> -n <workload cluster namespace> -o jsonpath='{.spec.kubeadmConfigSpec.mounts}'
kubectl get kcp <kcp name> -n <workload cluster namespace> -o yaml > kcp_backup.yaml
kubectl patch kcp <kcp name> -n <workload cluster namespace> --type=json -p='[{"op": "remove", "path": "/spec/kubeadmConfigSpec/mounts"}]'
kubectl patch cluster <workload cluster name> -n <workload cluster namespace> --type=merge -p='{"spec":{"topology":{"controlPlane":{"machineHealthCheck":{"enable":true}}}}}'
kubectl annotate machine -n <workload cluster namespace> <stuck provisioned machine name> 'cluster.x-k8s.io/remediate-machine=""'
Note: The machine may get automatically updated prior to needing being annotated.
watch kubectl get machine,vm -o wide -n <workload cluster namespace>