TKC Upgrade from v1.28.7 to v1.29.4 stuck due to "found pre-existing kapp-controller in the workload cluster before initiating upgrade; requires user remediation"
search cancel

TKC Upgrade from v1.28.7 to v1.29.4 stuck due to "found pre-existing kapp-controller in the workload cluster before initiating upgrade; requires user remediation"

book

Article ID: 403742

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Upgrading TKC from v1.28.7 (Legacy TKr) to v1.29.4 (Non-legacy KR) is stuck and not progressing.

 

While connected to the Supervisor cluster context, the following symptoms are observed:

  • The TKC resource shows READY:False
    # kubectl get tkc -A
    NAMESPACE   NAME           CONTROL PLANE   WORKER   KUBERNETES RELEASE NAME           AGE    READY
    <namespace>     <cluster name>  #               #        v1.29.4---vmware.3-fips.1-tkg.1   ###d   False

     

  • The corresponding Cluster resource is still on the previous version 1.28.7:
    # kubectl get cluster -A
    NAMESPACE   NAME           CLUSTERCLASS             PHASE         AGE    VERSION
    <namespace>     <cluster name>   builtin-generic-v#.#.#   Provisioned  ###d   v1.28.7+vmware.1-fips.1

     

  • Describing the TKC resource shows an error message similar to the below:
    # kubectl -n <namespace> describe tkc <cluster name>
    Conditions:
      Last Transition Time:  YYYY-MM-DDTHH:MM:SSZ
      Message:   Error in fetching ClusterBootstrap
      Reason:    ClusterBootstrapFailed
      Severity:  Warning
      Status:    False
    Events:
      Type    Reason   Age                   From                                                                                             Message
      ----    ------   ----                  ----                                                                                             -------
      Normal  Warning  ##s (x## over #m##s)  svc-tkg-domain-c<id>/svc-tkg-domain-c<id>-tkg-controller/tanzukubernetescluster-spec-controller  found pre-existing kapp-controller in the workload cluster before initiating upgrade; requires user remediation

     

  • The system pod vmware-system-tkg-controller-manager logs report the error "Could not resolve KR/OSImage", similar to the following:
    kubectl -n svc-tkg-domain-c<id> logs deployment/vmware-system-tkg-controller-manager
    E0701 HH:MM:SS.sssss       1 regeneratetkrdata_controller.go:97] "regenerating TKR_DATA" err=
      Could not resolve KR/OSImage
      Missing compatible KR/OSImage for the cluster
      Control Plane, filters: {k8sVersionPrefix: v1.28.7+vmware.1-fips.1, osImageSelector: os-name=photon,tkr.tanzu.vmware.com/standard}
      MachineDeployment <MachineDeployment NAME>, filters: {k8sVersionPrefix: v1.28.7+vmware.1-fips.1, osImageSelector: os-name=<os>}

Environment

vCenter 8.0u3

vSphere Supervisor

VKS 3.4.0 or lower

Cause

The upgrade process from legacy KR to non-legacy KR fails when an existing kapp-controller is present in the legacy TKC.

Kapp-Controller could be optionally manually installed in workload clusters on legacy KRs.

All non-legacy KRs will have Kapp-Controller included and automatically installed in the workload cluster.

As a result, if Kapp-Controller was manually installed in the workload cluster before upgrading to a non-legacy KR, this pre-existing kapp-controller issue will appear.

This issue indicates that the system installed Kapp-Controller as part of non-legacy KRs is failing to install because it is trying to assume ownership of kapp-controller owned objects that are currently owned by the manually installed kapp-controller.

Issue Sequence:

  1. The upgrade validation fails because the kapp-controller already exists in the target legacy TKC
  2. As a result, the upgrade is blocked, and the Cluster resource is not updated
  3. The Cluster resource continues to reference the legacy KR v1.28.7
  4. The ClusterBootstrap resource required for non-legacy KR is never created, causing the upgrade process to become stuck

KR v1.28.7 is the latest and final legacy release.  

After applying the workaround in this KB, this issue will not re-appear in subsequent upgrades.

Resolution

This workaround involves identifying if a clusterbootstrap exists or not and applying an annotation to allow for transferring ownership from the manually installed kapp-controller to the system installed kapp-controller.

 

Clusterbootstrap Workaround

  1. Connect into the Supervisor cluster context

  2. Confirm if a clusterbootstrap exists for the affected workload cluster. If it exists, it will have the same name as the affected workload cluster:
    kubectl get clusterbootstrap -n <affected workload cluster namespace>

     

  3. If the clusterbootstrap does not exist, proceed with the next steps.
    Otherwise if the clusterbootstrap does exist, skip to the next section "Transfer Kapp-Controller Ownership"


  4. Create a placeholder for the clusterbootstrap:
    1. Create the YAML file for the placeholder clusterbootstrap: 
      Note:
      This YAML assumes that the desired KR version is v1.29.4.

      cat > placeholder-clusterbootstrap.yaml << EOF
      apiVersion: run.tanzu.vmware.com/v1alpha3
      kind: ClusterBootstrap
      metadata:
        name: "<affected workload cluster>"
        namespace: "<affected workload cluster namespace>"
        annotations:
          tkg.tanzu.vmware.com/add-missing-fields-from-tkr: "v1.29.4---vmware.3-fips.1-tkg.1"
      spec:
        paused: true
      EOF


    2. Use kubectl to apply the placeholder clusterbootstrap and create it in paused state:
      kubectl apply -f placeholder-clusterbootstrap.yaml

       

    3. Confirm that the placeholder clusterbootstrap created successfully:
      kubectl get -n <affected workload cluster namespace> clusterbootstrap <affected workload cluster>

       

  5. Update the placeholder clusterBootstrap's status to trigger remediation:
    1. Define a temporary timestamp variable:

      CREATION_TIMESTAMP=$(kubectl get clusterbootstrap <affected workload cluster> -n <affected workload cluster namespace> -o jsonpath='{.metadata.creationTimestamp}')

       

    2. Update the clusterbootstrap status:
      kubectl patch clusterbootstrap <affected workload cluster> -n <affected workload cluster namespace> --subresource=status --type='merge' --patch="{\"status\":{\"conditions\":[{\"status\":\"False\",\"type\":\"Kapp-Controller-Workaround\",\"lastTransitionTime\":\"$CREATION_TIMESTAMP\"}]}}"

       

  6. The control plane node upgrade will be triggered. Monitor that the rolling upgrade of the control plane node completes (PHASE: Running)
    kubectl -n <affected workload cluster namespace> get machines | grep <affected workload cluster>



  7. Do not skip this step: After all control plane nodes are on the desired version and stabilized in Running state, unpause the placeholder clusterBootstrap:
    1. Confirm that the clusterbootstrap is paused:

      kubectl -n <affected workload cluster namespace> get clusterbootstrap <affected workload cluster> -o yaml | grep -i pause
          paused: true

       

    2. Unpause the placeholder clusterbootstrap:
      kubectl -n <affected workload cluster namespace> patch clusterbootstrap <affected workload cluster> --type=merge --patch '{"spec":{"paused":false}}'

       This will generate a kapp-controller for the non-legacy KR.


Transfer Kapp-Controller Ownership

The below steps will be performed from the Supervisor cluster context.

  1. Confirm the status of the affected cluster's kapp-controller. This resource is expected to be "Reconcile failed".
    kubectl -n <affected workload cluster namespace> get pkgi <affected workload cluster>-kapp-controller
    NAME                               PACKAGE NAME                       PACKAGE VERSION                DESCRIPTION
    <affected workload cluster>-kapp-controller   kapp-controller.tanzu.vmware.com   0.50.0+vmware.1-tkg.1-vmware   Reconcile failed: Error (see .status.usefulErrorMessage for details)

     

  2. Replace the pre-installed kapp-controller ownership to the system installed kapp-controller included with non-legacy KRs:
    1. Create a secret resource YAML file called "kapp-edit-ytt":

      cat > kapp-edit-ytt.yaml <<EOF
      #@ load("@ytt:overlay", "overlay")
      #@overlay/match by=overlay.subset({"kind":"Deployment", "metadata": {"name": "kapp-controller", "namespace":"tkg-system"}})
      ---
      metadata:
        annotations:
          #@overlay/match missing_ok=True
          kapp.k14s.io/update-strategy: fallback-on-replace
      EOF

       

    2. Use kubectl to create a generic secret resource using the above kapp-edit-ytt.yaml file:
      kubectl -n <affected workload cluster namespace> create secret generic kapp-edit-ytt --from-file=kapp-edit-ytt.yaml

       

    3. Confirm that the generic secret resource was successfully created:
      kubectl -n <affected workload cluster namespace> get secrets kapp-edit-ytt
      
      NAME            TYPE     DATA   AGE
      kapp-edit-ytt   Opaque   1      #s

       

  3. Add the below annotation to reflect the above YTT overlay on the affected cluster's kapp controller PKGI:
    kubectl -n <affected workload cluster namespace> annotate pkgi <affected workload cluster>-kapp-controller ext.packaging.carvel.dev/ytt-paths-from-secret-name.0=kapp-edit-ytt

     

  4. Confirm that the kapp controller PKGI for the affected workload cluster now shows "Reconcile succeeded"
    kubectl -n <affected workload cluster namespace> get pkgi <affected workload cluster>-kapp-controller
    
    NAME                               PACKAGE NAME                       PACKAGE VERSION                DESCRIPTION
    <affected workload cluster>-kapp-controller   kapp-controller.tanzu.vmware.com   0.50.0+vmware.1-tkg.1-vmware   Reconcile succeeded

     

  5. The upgrade will successfully progress and proceed onto the worker node-pools:
    kubectl -n <affected workload cluster namespace> get machines | grep <affected workload cluster>

    The TKC will become Ready True once all nodes are stabilized on the desired version:

    kubectl -n <affected workload cluster namespace> get tkc <affected workload cluster>
    
    NAMESPACE   NAME           CONTROL PLANE   WORKER   KUBERNETES RELEASE NAME           AGE    READY
    <affected workload cluster namespace>  <affected workload cluster>   #              #        v1.29.4---vmware.3-fips.1-tkg.1   ###d   True

     

  6. Clean up the added overlay, kapp-edit-ytt secret resource and YAML files:
    kubectl -n <affected workload cluster namespace> annotate pkgi <affected workload cluster>-kapp-controller ext.packaging.carvel.dev/ytt-paths-from-secret-name.0-
    
    kubectl -n <affected workload cluster namespace> delete secret kapp-edit-ytt
    
    rm kapp-edit-ytt.yaml
    
    rm placeholder-clusterbootstrap.yaml
 

Additional Information