Workload Cluster Upgrade from v1.28.15 to v1.29.4 Stuck and Cluster Manually Unpaused
search cancel

Workload Cluster Upgrade from v1.28.15 to v1.29.4 Stuck and Cluster Manually Unpaused

book

Article ID: 415631

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

According to the release notes, Upgrading the VKS cluster to VKr v1.29.4 from v1.28.15 is not supported. This would result in a back in time upgrade for some of the packages and the fixes available in v1.28.15 patch will not be available in 1.29.4.

If you have already initiated an upgrade from v1.28.15 to v1.29.4 but did not manually unpause the cluster, follow vSphere 8.0 Supervisor Workload Cluster Upgrade Stuck with No Nodes on Desired Upgraded Version to revert the cluster back to v1.28.15.

This KB is to cover the scenario where a user manually unpaused the cluster, which leads to a v1.29.4 control plane rollout that gets stuck in NotReady status.

While connected to the Supervisor cluster context, one or more of the following symptoms are observed:

  • The workload cluster's control plane nodes are on the desired version of v1.29.4 but are recreating continuously in a loop:
    kubectl get machines -n <cluster namespace>

    The worker nodes are still on the old version of v1.28.15

  • Checking the clusterbootstrap object still shows the previous package versions and VKR version of v1.28.15:
    kubectl get clusterbootstrap -n <cluster namespace>
    antrea.tanzu.vmware.com.1.13.3+vmware.3-tkg.2-vmware   vsphere-pv-csi.tanzu.vmware.com.3.1.0+vmware.1-tkg.6-vmware   vsphere-cpi.tanzu.vmware.com.1.28.0+vmware.1-tkg.2-vmware   kapp-controller.tanzu.vmware.com.0.50.0+vmware.2-tkg.1-vmware   v1.28.15---vmware.3-fips-vkr.3


  • Both the API server and the tanzu-addons-controller-manager pod show logs similar to the one below, indicating a kapp-controller downgrade error.
    ClusterBootstrap.run.tanzu.vmware.com <cluster name> is invalid: spec.kapp.refName: Invalid value: \"kapp-controller.tanzu.vmware.com.0.50.0+vmware.1-tkg.1-vmware\": package downgrade is not allowed, original version: 0.50.0+vmware.2-tkg.1-vmware, updated version 0.50.0+vmware.1-tkg.1-vmware

     

  • If vSphere 8.0 Supervisor Workload Cluster Upgrade Stuck with No Nodes on Desired Upgraded Version was followed or the validatingwebhookconfiguration for clusterbootstrap was deleted, the above error may no longer appear. Instead, kapp-controller PackageInstall (PKGI) for the cluster may report the below error:
    status:
    
      conditions:
    
      - message: Error (see .status.usefulErrorMessage for details)
    
        status: "True"
    
        type: ReconcileFailed
    
      friendlyDescription: 'Reconcile failed: Error (see .status.usefulErrorMessage for
    
        details)'
    
      lastAttemptedVersion: 0.50.0+vmware.2-tkg.1-vmware
    
      observedGeneration: 2
    
      usefulErrorMessage: |-
    
        Stopped installing matched version '0.50.0+vmware.1-tkg.1-vmware' since last attempted version '0.50.0+vmware.2-tkg.1-vmware' is higher.
    
        hint: Add annotation packaging.carvel.dev/downgradable: "" to PackageInstall to proceed with downgrade
    
      version: 0.50.0+vmware.1-tkg.1-vmware
  •  

 

Environment

VKS  v3.3.2 and below

Cause

Upgrading VKS Cluster to v1.29.4 effectively results in downgrading of the kapp-controller package, because the previous version (v1.28.15) was released after v1.29.4.

During the upgrade to v1.29.4, the addon-manager paused the cluster to upgrade the addons but failed after detecting the version downgrade. As a result, the cluster remains in a paused state.

If the cluster is manually unpaused, it causes the upgrade of the cluster to proceed with v1.29.4 without upgrading the corresponding kapp-controller addons, leading to a failure since the non-upgraded kapp-controller addon is incompatible with the upgraded node(s).

Manually unpausing a cluster is an unsupported action.

Resolution

Reach out to VMware by Broadcom Technical Support for assistance and reference this KB article.

Additional Information

This type of “back-in-time upgrade” should not occur in VKS 3.3.3 or later, as a webhook constraint was introduced to prevent it.