When upgrading a Tanzu Kubernetes Cluster (TKC) from legacy TKr versions (e.g., 1.26.x, 1.27.x) to TKG 2.0 TKr (e.g., 1.27.11), the cluster may fail to reach a "Ready" state.
Symptoms include:
New nodes are stuck in a "Provisioning" state and do not reach "Running."
The NodeName (status.nodeRef) is not populated on the Machine object.
The Node object is missing the spec.providerID field.
TKC status reports Ready: False and ClusterBootstrapReadyCondition remains in a reconciling state.
The following log signature is found in the guest-cluster-cloud-provider pod within the vmware-system-cloud-provider namespace: YYYY-MM-DDT HH:MM:SS Error trying to find VM: Unauthorized
VMware vSphere Kubernetes Service
VMware Tanzu Kubernetes Grid (TKG)
During the migration from the legacy addonsController to the TKG 2.0 addonsController, the name of the ProviderServiceAccount is changed from <cluster-name>-ccm to <vsphere-cluster-name>-ccm.
Because the vSphereCluster object contains a unique suffix, the guest-cluster-cloud-provider pod attempts to use an old, cached token from the previous secret. Authentication fails with an "Unauthorized" error before the Cluster API Provider vSphere (CAPV) can reconcile the new account details and update the secret.
A permanent resolution is included in TKr version 1.29 and later. For clusters on version 1.28 or earlier, the following workaround is required:
Restart the Cloud Provider Pod: Delete the existing guest-cluster-cloud-provider pod to force it to pick up the new token.
Switch the context to the affected guest cluster.
Retrieve the pod name: kubectl get pod -n vmware-system-cloud-provider
Delete the pod: kubectl delete pod <pod-name> -n vmware-system-cloud-provider
Remove Stale Nodes: If nodes are in a NotReady state and do not have a corresponding Virtual Machine in vCenter:
Manually delete the node object from the cluster: kubectl delete node <node-name>
Cluster health can be further verified by checking for Antrea-ReconcileFailed or Vsphere-Pv-Csi-ReconcileFailed conditions on the TKC object, which often result from the initial authentication failure of the Cloud Provider Interface.