Upgrading TKC from v1.28.7 (Legacy TKr) to v1.29.4 (Non-legacy KR) gets stuck.
TKC resource said READY:False
# kubectl get tkc -A
NAMESPACE NAME CONTROL PLANE WORKER KUBERNETES RELEASE NAME AGE READY
test-ns test-cluster 3 5 v1.29.4---vmware.3-fips.1-tkg.1 568d False
The VERSION value is not propagated to the Cluster resource. It still refers to v1.28.7.
# kubectl get cluster -A
NAMESPACE NAME CLUSTERCLASS PHASE AGE VERSION
test-ns test-cluster builtin-generic-v3.1.0 Provisioned 568d v1.28.7+vmware.1-fips.1
TKC resource indicated the error message.
# kubectl -n test-ns describe tkc test-cluster
Conditions:
Last Transition Time: 20YY-MM-DDT10:03:31Z
Message: Error in fetching ClusterBootstrap
Reason: ClusterBootstrapFailed
Severity: Warning
Status: False
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Warning 50s (x25 over 4m33s) svc-tkg-domain-cXXXX/svc-tkg-domain-cXXXX-tkg-controller/tanzukubernetescluster-spec-controller found pre-existing kapp-controller in the workload cluster before initiating upgrade; requires user remediation
vmware-system-tkg-controller-manager pod said "Could not resolve KR/OSIMage".
# kubectl -n svc-tkg-domain-cXXXX logs deployment/vmware-system-tkg-controller-manager
E0701 HH:MM:SS.sssss 1 regeneratetkrdata_controller.go:97] "regenerating TKR_DATA" err=
Could not resolve KR/OSImage
Missing compatible KR/OSImage for the cluster
Control Plane, filters: {k8sVersionPrefix: v1.28.7+vmware.1-fips.1, osImageSelector: os-name=photon,tkr.tanzu.vmware.com/standard}
MachineDeployment <MD NAME>, filters: {k8sVersionPrefix: v1.28.7+vmware.1-fips.1, osImageSelector: os-name=photon}
VMware vSphere Kubernetes Service - VC8U3
The upgrade process from legacy TKr to non-legacy TKr fails when an existing kapp-controller is present in the legacy TKC.
Issue Sequence:
1. The upgrade validation fails because the kapp-controller already exists in the target legacy TKC
2. As a result, the upgrade is blocked, and the Cluster resource is not updated
3. The Cluster resource continues to reference the legacy TKr v1.28.7
4. The ClusterBootstrap resource required for non-legacy KR is never created, causing the upgrade process to become stuck
TKr v1.28.7 is the last legacy release. After applying the workaround in this KB, the issue does not occur in subsequent upgrades.
1. SSH login to the Supervisor node
2. Create a dummy kind:clusterBootstrap to bypass the upgrade validation
# Set the values
VSPHERE_NAMESPACE="test-ns"
TARGET_CLUSTER="test-cluster"
TKR_NAME="v1.29.4---vmware.3-fips.1-tkg.1" # Upgrade destination
cat > dummy-clusterbootstrap.yaml << EOF
apiVersion: run.tanzu.vmware.com/v1alpha3
kind: ClusterBootstrap
metadata:
name: $TARGET_CLUSTER
namespace: $VSPHERE_NAMESPACE
annotations:
tkg.tanzu.vmware.com/add-missing-fields-from-tkr: $TKR_NAME
spec:
paused: true
EOF
kubectl apply -f dummy-clusterbootstrap.yaml
3. Update the kind:clusterBootstrap status to kick the reconcile
CREATION_TIMESTAMP=$(kubectl get clusterbootstrap $TARGET_CLUSTER -n $VSPHERE_NAMESPACE -o jsonpath='{.metadata.creationTimestamp}')
kubectl patch clusterbootstrap $TARGET_CLUSTER -n $VSPHERE_NAMESPACE --subresource=status --type='merge' --patch="{\"status\":{\"conditions\":[{\"status\":\"False\",\"type\":\"Kapp-Controller-Workaround\",\"lastTransitionTime\":\"$CREATION_TIMESTAMP\"}]}}"
4. The control plane node upgrade will be triggered. Wait until the rolling upgrade of the control plane node completes (PHASE: Running)
kubectl -n $VSPHERE_NAMESPACE get ma -w | grep $TARGET_CLUSTER
Confirm that the worker node upgrade will be stuck. The newly created node remains in the PHASE:Provisioning state instead of transitioning to Running.
5. Generate a pkgi:<TARGET_CLUSTER>-kapp-controller for non-legacy KR
kubectl -n $VSPHERE_NAMESPACE patch clusterbootstrap $TARGET_CLUSTER --type=merge --patch '{"spec":{"paused":false}}'
Confirm the result. The resource will be "Reconcile failed".
kubectl -n $VSPHERE_NAMESPACE get pkgi ${TARGET_CLUSTER}-kapp-controller
#> NAME PACKAGE NAME PACKAGE VERSION DESCRIPTION
#> <TARGET_CLUSTER>-kapp-controller kapp-controller.tanzu.vmware.com 0.50.0+vmware.1-tkg.1-vmware Reconcile failed: Error (see .status.usefulErrorMessage for details)
6. Create a new secret to replace the pre-installed kapp-controller in the legacy TKC
cat > kapp-edit-ytt.yaml <<EOF
#@ load("@ytt:overlay", "overlay")
#@overlay/match by=overlay.subset({"kind":"Deployment", "metadata": {"name": "kapp-controller", "namespace":"tkg-system"}})
---
metadata:
annotations:
#@overlay/match missing_ok=True
kapp.k14s.io/update-strategy: fallback-on-replace
EOF
kubectl -n $VSPHERE_NAMESPACE create secret generic kapp-edit-ytt --from-file=kapp-edit-ytt.yaml
7. Add the annotation to reflect the above YTT overlay
kubectl -n $VSPHERE_NAMESPACE annotate pkgi ${TARGET_CLUSTER}-kapp-controller ext.packaging.carvel.dev/ytt-paths-from-secret-name.0=kapp-edit-ytt
Check the result - DESCRIPTION will be "Reconcile succeeded"
kubectl -n $VSPHERE_NAMESPACE get pkgi ${TARGET_CLUSTER}-kapp-controller
#> NAME PACKAGE NAME PACKAGE VERSION DESCRIPTION
#> <TARGET_CLUSTER>-kapp-controller kapp-controller.tanzu.vmware.com 0.50.0+vmware.1-tkg.1-vmware Reconcile succeeded
8. Worker node rolling upgrade will be reconciled
kubectl -n $VSPHERE_NAMESPACE get ma -w | grep $TARGET_CLUSTER
9. TKC will be READY:True after 10-15 minutes
kubectl -n $VSPHERE_NAMESPACE get tkc $TARGET_CLUSTER -w
#> NAMESPACE NAME CONTROL PLANE WORKER KUBERNETES RELEASE NAME AGE READY
#> test-ns test-cluster 3 5 v1.29.4---vmware.3-fips.1-tkg.1 568d True
10. Cleanup
kubectl -n $VSPHERE_NAMESPACE annotate pkgi ${TARGET_CLUSTER}-kapp-controller ext.packaging.carvel.dev/ytt-paths-from-secret-name.0-
kubectl -n $VSPHERE_NAMESPACE delete secret kapp-edit-ytt
rm kapp-edit-ytt.yaml
rm dummy-clusterbootstrap.yaml