Tanzu Kubernetes Cluster (TKC) upgrade from TKG 1.x to 2.0 stuck in "Ready: False"

search cancel

Tanzu Kubernetes Cluster (TKC) upgrade from TKG 1.x to 2.0 stuck in "Ready: False"

book

Article ID: 434893

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When upgrading a Tanzu Kubernetes Cluster (TKC) from legacy TKr versions (e.g., 1.26.x, 1.27.x) to TKG 2.0 TKr (e.g., 1.27.11), the cluster may fail to reach a "Ready" state.

Symptoms include:

New nodes are stuck in a "Provisioning" state and do not reach "Running."
The NodeName (status.nodeRef) is not populated on the Machine object.
The Node object is missing the spec.providerID field.
TKC status reports Ready: False and ClusterBootstrapReadyCondition remains in a reconciling state.
The following log signature is found in the guest-cluster-cloud-provider pod within the vmware-system-cloud-provider namespace: YYYY-MM-DDT HH:MM:SS Error trying to find VM: Unauthorized

Environment

VMware vSphere Kubernetes Service

VMware Tanzu Kubernetes Grid (TKG)

Cause

During the migration from the legacy addonsController to the TKG 2.0 addonsController, the name of the ProviderServiceAccount is changed from <cluster-name>-ccm to <vsphere-cluster-name>-ccm.

Because the vSphereCluster object contains a unique suffix, the guest-cluster-cloud-provider pod attempts to use an old, cached token from the previous secret. Authentication fails with an "Unauthorized" error before the Cluster API Provider vSphere (CAPV) can reconcile the new account details and update the secret.

Resolution

A permanent resolution is included in TKr version 1.29 and later. For clusters on version 1.28 or earlier, the following workaround is required:

Restart the Cloud Provider Pod: Delete the existing guest-cluster-cloud-provider pod to force it to pick up the new token.
- Switch the context to the affected guest cluster.
- Retrieve the pod name:
  
  kubectl get pod -n vmware-system-cloud-provider
- Delete the pod:
  
  kubectl delete pod <pod-name> -n vmware-system-cloud-provider
Remove Stale Nodes: If nodes are in a NotReady state and do not have a corresponding Virtual Machine in vCenter:
- Manually delete the node object from the cluster:
  
  kubectl delete node <node-name>

Additional Information

Cluster health can be further verified by checking for Antrea-ReconcileFailed or Vsphere-Pv-Csi-ReconcileFailed conditions on the TKC object, which often result from the initial authentication failure of the Cloud Provider Interface.

Feedback

thumb_up Yes

thumb_down No