vSphere Kubernetes Cluster Upgrade Stuck due to Cluster Upgrade Started before Supervisor Cluster vSphere 7 to vSphere 8 Migration has Completed
search cancel

vSphere Kubernetes Cluster Upgrade Stuck due to Cluster Upgrade Started before Supervisor Cluster vSphere 7 to vSphere 8 Migration has Completed

book

Article ID: 383765

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

After initiating an upgrade from vSphere 7 to vSphere 8, the Supervisor cluster upgrade and manually initiated vSphere Kubernetes Cluster upgrade is stuck, not progressing.

 

While connected to the Supervisor cluster, the following symptoms are present:

  • The TKC object shows the following label and annotation:
    • kubectl describe tkc <tkc name> -n <cluster namespace>

      Labels:       run.tanzu.vmware.com/migrate-tkc=

      Annotations:  run.tanzu.vmware.com/tkc-upgrade-from: <TKR VERSION>

    • The above "migrate-tkc" label indicates that the TKC is mid-migration.

  • The cluster and TKC object both show the desired upgrade version but the corresponding machine, kcp and machinedeployment objects still show the previous version:
    • kubectl get tkc,machine,kcp,md -n <cluster namespace>

  • The affected cluster's corresponding KCP object shows an error message similar to the below:
    • kubectl describe kcp <kcp name> -n <cluster namespace>

      "Failed to get VirtualMachineImage <ob-virtualmachineimage-name>: VirtualMachineImage.vmoperator.vmware.com "<ob-virtualmachineimage-name>" not found' "

  • Virtual machine images have been recreated as cluster virtual machine images (cvmi):
    • kubectl get cvmi -A

  • The Supervisor cluster upgrade is incomplete and keeps retrying but repeatedly gets stuck at UtkgClusterMigration:
    • /usr/lib/vmware-wcp/upgrade/upgrade-ctl.py get-status | jq '.progress | to_entries | .[] | "\(.value.status) - \(.key)"' | sort

      "failed - utkgClusterMigration" or "pending - utkgClusterMigration"

  • There are wcpcluster, wcpmachine, wcpmachinetemplate objects present in the Supervisor cluster:
    • kubectl get wcpcluster,wcpmachine,wcpmachinetemplate -A
    • NOTE: The above wcp objects being present indicate that the migration is yet to complete for the corresponding vSphere Kubernetes cluster(s). These wcp objects may still be present even if the corresponding TKC object does not have the "migrate-tkc" label.

  • The kapp-controller packageinstall (pkgi) associated with the affected vSphere Kubernetes cluster is in ReconcileFailed state:
    • kubectl get pkgi -n <affected cluster namespace>

      NAMESPACE  NAME                         PACKAGE NAME         PACKAGE VERSION                DESCRIPTION
      <cluster namespace> packageinstall.packaging.carvel.dev/my-cluster-kapp-controller kapp-controller.tanzu.vmware.com   X.XX.X+vmware.X-tkg.X-vmware   Reconcile failed: Error (see .status.usefulErrorMessage for details)

 


Depending on when the vSphere Kubernetes cluster was upgraded before the migration became stuck, the following symptoms may be present:

  • The tkg-controller-manager pods may be in CrashLoopBackOff state with errors similar to the below where the noted TKR version is the desired TKR:
    • "failed to get controlplane TKR for TKC cluster from supervisor ... TKR v#.##.#---vmware.#" "parsing semantic version from string '': could not parse \"\" as version" 

  • The upgrade-ctl-compupgrade.log shows an error similar to the below where v#.##.# is the desired TKR version:
    • "Failed to run command kubectl label overwrite tkc for my-cluster, could not find spec.distribution.fullVersion; Component upgrade failed for v#.##.#"
    • Note: This log file is only present on 1/3 of the Supervisor control plane VMs

 

Environment

VMware vSphere 8.0 with Tanzu

This issue can occur on vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

Upgrades from vSphere 7 to vSphere 8 undergo a migration of wcp objects into vsphere objects and virtualmachineimages into clustervirtualmachineimages (cvmi).

The associated components are updated to reference the newly created objects for migration.

In addition to the above, the TKR naming conventions change and can lead to a TKR version mismatch when initiating the vSphere Kubernetes cluster upgrade before the migration has completed. This can also occur when the cluster upgrade is initiated from Tanzu Mission Control (TMC) as TMC pulls the TKR version data from the vSphere Kubernetes environment.

However if a vSphere Kubernetes cluster upgrade is started before the migration completes, both the Supervisor cluster upgrade and vSphere Kubernetes cluster upgrade become stuck. This is because the system is trying to prioritize completing the cluster upgrade but cannot due to cluster components which are still referencing pre-migration objects. These pre-migration objects were already replaced in the migration process.

When upgrading a vSphere Kubernetes cluster from a legacy TKR for vSphere 7 to a TKR for vSphere 8, the kapp-controller package is automatically installed on the vSphere Kubernetes cluster. However, the kapp-controller pkgi will be stuck in ReconcileFailed state until the cluster upgrade completes.

Documentation ("Upgrading from any vCenter Server release to any vCenter Server 8.x release"): https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/8-0/using-tkg-service-with-vsphere-supervisor/updating-tkg-service-clusters/understanding-the-rolling-update-model-for-tkg-service-clusters.html

Resolution

Please open a ticket to VMware by Broadcom support referencing this KB article for assistance in reverting the affected cluster's upgrade and completing the migration.

Once the migration is completed, the Supervisor upgrade will complete and the environment will stabilize where the vSphere Kubernetes cluster upgrade can be restarted to complete successfully.

Additional Information

Beginning in TKG Service 3.2.1 and 3.3.0, the system will prevent starting vSphere Kubernetes cluster upgrades before the vSphere 7 to vSphere 8 migration completes.

TKG Service Documentation: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere-supervisor/8-0/using-tkg-service-with-vsphere-supervisor/installing-and-upgrading-the-tkg-service.html