When an upgrade is interrupted due to an issue with infrastructure or other factors, the upgrade may stall. When you examine the cluster object, you'll notice that it is labeled with the upgraded version; however, the nodes in the cluster are still running the previous Tanzu Kubernetes Release.
tanzu cluster list -o yaml
- name: clustertest
namespace: default
status: running
plane: prod
controlplane: 1/1
workers 3/3
kubernetes: v1.28.7---vmware.1
roles: []
labels:
tanzuKubernetesRelease: v1.28.7---vmware.1
tkg.tanzu.vmware.com/cluster-name: clustertest
kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION
clustertest-control-plane-1 Ready control-plane 10d v1.28.3---vmware.1
clustertest-worker-1 Ready <none> 10d v1.28.3---vmware.1
clustertest-worker-2 Ready <none> 10d v1.28.3---vmware.1
clustertest-worker-3 Ready <none> 10d v1.28.3---vmware.1
This issue can occur when a TKGm upgrade is interrupted during the process. When this happens, the cluster may enter an 'upgradeStalled' status.
Once the root cause of the interruption has been resolved, a potential workaround involves updating the cluster object to reflect the current version that the nodes are running. This should allow you to trigger the upgrade process again.
1). From the management cluster context, list cluster objects:
kubectl get cluster
2). Take a backup of the cluster object before making any changes:
kubectl get cluster clustertest -o yaml > clusterObjectBackup.yaml
3). Edit the cluster object and update labels.tanzuKubernetesRelease:
to the TKR version that the nodes are currently running (i.e., the version the nodes were using prior to the attempted upgrade):
kubectl edit cluster clustertest
4). After updating the cluster object, you should be able to trigger the upgrade process again.