After initiating a vSphere Kubernetes Release (VKR) upgrade on a vSphere Kubernetes Service (VKS) cluster, the upgrade is stuck and not progressing.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get cluster,clusterbootstrap -n <affected vks cluster namespace>
kubectl get kcp,md,machines -n <affected vks cluster namespace>
While connected to the affected VKS cluster's context, the following symptoms are observed:
kubectl get pkgi -A
kubectl describe pkgi -n <pkgi namespace> <pkgi name>
Stopped installing matched version '<version A>' since last attempted version '<version B>' is higher. hint: Add annotation packaging.carvel.dev/downgradable: "" to PackageInstall to proceed with downgrade
kubectl get pods -A
A pod looking to run on a node with the desired VKR would have a Node-Selector noting the VKR version, similar to the below:
kubectl describe pod -n <pod namespace> <pod name>
Node-Selectors:
run.tanzu.vmware.com/tkr=v#.##.##---vmware.#-fips-vkr.#
vSphere Supervisor
VKS Cluster
Unsupported actions performed during a VKS cluster VKR upgrade have resulted in the system becoming stuck trying to complete a rolling redeployment change to nodes within the affected cluster but the VKR upgrade has triggered an update to the package versions within the affected VKS cluster.
As a result, the system is keeping the nodes on the old VKR version, but trying to create system pods using images only on the new VKR version which results in this stuck state.
DISCLAIMER: Because this issue is caused by an unsupported actions performed in the environment, the internal steps to resolve this issue are not guaranteed to work and a full redeployment of the VKS cluster may be necessary.
Reach out to VMware by Broadcom Technical Support referencing this KB article.