TKC Upgrade from v1.28.7 to v1.29.4 stuck due to "found pre-existing kapp-controller in the workload cluster before initiating upgrade; requires user remediation"

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Upgrading TKC from v1.28.7 (Legacy TKr) to v1.29.4 (Non-legacy KR) gets stuck.

TKC resource said READY:False

# kubectl get tkc -A
NAMESPACE   NAME           CONTROL PLANE   WORKER   KUBERNETES RELEASE NAME           AGE    READY
test-ns     test-cluster   3               5        v1.29.4---vmware.3-fips.1-tkg.1   568d   False

The VERSION value is not propagated to the Cluster resource. It still refers to v1.28.7.

# kubectl get cluster -A
NAMESPACE   NAME           CLUSTERCLASS             PHASE         AGE    VERSION
test-ns     test-cluster   builtin-generic-v3.1.0   Provisioned   568d   v1.28.7+vmware.1-fips.1

TKC resource indicated the error message.

# kubectl -n test-ns describe tkc test-cluster
Conditions:
  Last Transition Time:  20YY-MM-DDT10:03:31Z
  Message:   Error in fetching ClusterBootstrap
  Reason:    ClusterBootstrapFailed
  Severity:  Warning
  Status:    False
Events:
  Type    Reason   Age                   From                                                                                             Message
  ----    ------   ----                  ----                                                                                             -------
  Normal  Warning  50s (x25 over 4m33s)  svc-tkg-domain-cXXXX/svc-tkg-domain-cXXXX-tkg-controller/tanzukubernetescluster-spec-controller  found pre-existing kapp-controller in the workload cluster before initiating upgrade; requires user remediation

vmware-system-tkg-controller-manager pod said "Could not resolve KR/OSIMage".

# kubectl -n svc-tkg-domain-cXXXX logs deployment/vmware-system-tkg-controller-manager
E0701 HH:MM:SS.sssss       1 regeneratetkrdata_controller.go:97] "regenerating TKR_DATA" err=
  Could not resolve KR/OSImage
  Missing compatible KR/OSImage for the cluster
  Control Plane, filters: {k8sVersionPrefix: v1.28.7+vmware.1-fips.1, osImageSelector: os-name=photon,tkr.tanzu.vmware.com/standard}
  MachineDeployment <MD NAME>, filters: {k8sVersionPrefix: v1.28.7+vmware.1-fips.1, osImageSelector: os-name=photon}

Environment

VMware vSphere Kubernetes Service - VC8U3

Cause

The upgrade process from legacy TKr to non-legacy TKr fails when an existing kapp-controller is present in the legacy TKC.

Issue Sequence:
1. The upgrade validation fails because the kapp-controller already exists in the target legacy TKC
2. As a result, the upgrade is blocked, and the Cluster resource is not updated
3. The Cluster resource continues to reference the legacy TKr v1.28.7
4. The ClusterBootstrap resource required for non-legacy KR is never created, causing the upgrade process to become stuck

TKr v1.28.7 is the last legacy release. After applying the workaround in this KB, the issue does not occur in subsequent upgrades.

Resolution

1. SSH login to the Supervisor node

2. Create a dummy kind:clusterBootstrap to bypass the upgrade validation

# Set the values
VSPHERE_NAMESPACE="test-ns"
TARGET_CLUSTER="test-cluster"
TKR_NAME="v1.29.4---vmware.3-fips.1-tkg.1" # Upgrade destination

cat > dummy-clusterbootstrap.yaml << EOF
apiVersion: run.tanzu.vmware.com/v1alpha3
kind: ClusterBootstrap
metadata:
  name: $TARGET_CLUSTER
  namespace: $VSPHERE_NAMESPACE
  annotations:
    tkg.tanzu.vmware.com/add-missing-fields-from-tkr: $TKR_NAME
spec:
  paused: true
EOF

kubectl apply -f dummy-clusterbootstrap.yaml

3. Update the kind:clusterBootstrap status to kick the reconcile

CREATION_TIMESTAMP=$(kubectl get clusterbootstrap $TARGET_CLUSTER -n $VSPHERE_NAMESPACE -o jsonpath='{.metadata.creationTimestamp}')

kubectl patch clusterbootstrap $TARGET_CLUSTER -n $VSPHERE_NAMESPACE --subresource=status --type='merge' --patch="{\"status\":{\"conditions\":[{\"status\":\"False\",\"type\":\"Kapp-Controller-Workaround\",\"lastTransitionTime\":\"$CREATION_TIMESTAMP\"}]}}"

4. The control plane node upgrade will be triggered. Wait until the rolling upgrade of the control plane node completes (PHASE: Running)

kubectl -n $VSPHERE_NAMESPACE get ma -w | grep $TARGET_CLUSTER

Confirm that the worker node upgrade will be stuck. The newly created node remains in the PHASE:Provisioning state instead of transitioning to Running.

5. Generate a pkgi:<TARGET_CLUSTER>-kapp-controller for non-legacy KR

kubectl -n $VSPHERE_NAMESPACE patch clusterbootstrap $TARGET_CLUSTER --type=merge --patch '{"spec":{"paused":false}}'

Confirm the result. The resource will be "Reconcile failed".

kubectl -n $VSPHERE_NAMESPACE get pkgi ${TARGET_CLUSTER}-kapp-controller
#> NAME                               PACKAGE NAME                       PACKAGE VERSION                DESCRIPTION
#> <TARGET_CLUSTER>-kapp-controller   kapp-controller.tanzu.vmware.com   0.50.0+vmware.1-tkg.1-vmware   Reconcile failed: Error (see .status.usefulErrorMessage for details)

6. Create a new secret to replace the pre-installed kapp-controller in the legacy TKC

cat > kapp-edit-ytt.yaml <<EOF
#@ load("@ytt:overlay", "overlay")
#@overlay/match by=overlay.subset({"kind":"Deployment", "metadata": {"name": "kapp-controller", "namespace":"tkg-system"}})
---
metadata:
  annotations:
    #@overlay/match missing_ok=True
    kapp.k14s.io/update-strategy: fallback-on-replace
EOF

kubectl -n $VSPHERE_NAMESPACE create secret generic kapp-edit-ytt --from-file=kapp-edit-ytt.yaml

7. Add the annotation to reflect the above YTT overlay

kubectl -n $VSPHERE_NAMESPACE annotate pkgi ${TARGET_CLUSTER}-kapp-controller ext.packaging.carvel.dev/ytt-paths-from-secret-name.0=kapp-edit-ytt

Check the result - DESCRIPTION will be "Reconcile succeeded"

kubectl -n $VSPHERE_NAMESPACE get pkgi ${TARGET_CLUSTER}-kapp-controller
#> NAME                               PACKAGE NAME                       PACKAGE VERSION                DESCRIPTION
#> <TARGET_CLUSTER>-kapp-controller   kapp-controller.tanzu.vmware.com   0.50.0+vmware.1-tkg.1-vmware   Reconcile succeeded

8. Worker node rolling upgrade will be reconciled

kubectl -n $VSPHERE_NAMESPACE get ma -w | grep $TARGET_CLUSTER

9. TKC will be READY:True after 10-15 minutes

kubectl -n $VSPHERE_NAMESPACE get tkc $TARGET_CLUSTER -w
#> NAMESPACE   NAME           CONTROL PLANE   WORKER   KUBERNETES RELEASE NAME           AGE    READY
#> test-ns     test-cluster   3               5        v1.29.4---vmware.3-fips.1-tkg.1   568d   True

10. Cleanup

kubectl -n $VSPHERE_NAMESPACE annotate pkgi ${TARGET_CLUSTER}-kapp-controller ext.packaging.carvel.dev/ytt-paths-from-secret-name.0-

kubectl -n $VSPHERE_NAMESPACE delete secret kapp-edit-ytt
rm kapp-edit-ytt.yaml
rm dummy-clusterbootstrap.yaml

Additional Information

vSphere with Tanzu TKC upgrade from TKG1.0 to TKG2.0 TKr fails due to KAPP controller deployment