How to downgrade VKS clusterbootstrap
search cancel

How to downgrade VKS clusterbootstrap

book

Article ID: 432141

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After initiating a vSphere Kubernetes Release (VKR) upgrade on a vSphere Kubernetes Service (VKS) cluster, the upgrade is stuck and not progressing.

 

While connected to the Supervisor cluster context, the following symptoms are observed:

  • The cluster and clusterbootstrap have been updated to the desired VKR version:
    kubectl get cluster,clusterbootstrap -n <affected vks cluster namespace>

     

  • However, the nodes and managing objects (kubeadmcontrolplane (KCP), machinedeployment (MD)) are still on the old VKR version:
    kubectl get kcp,md,machines -n <affected vks cluster namespace>

     

While connected to the affected VKS cluster's context, the following symptoms are observed:

  • PackageInstalls (PKGI) are in ReconcileFailed state with an error message similar to the following:
    kubectl get pkgi -A
    
    kubectl describe pkgi -n <pkgi namespace> <pkgi name>
    
    Stopped installing matched version '<version A>' since last attempted version '<version B>' is higher. hint: Add annotation packaging.carvel.dev/downgradable: "" to PackageInstall to proceed with downgrade

     

  • There are new system pods in Pending or ErrImagePull or ImagePullBackOff state within the VKS cluster.
    These pods are trying to run on nodes with the desired VKR and use an image version only available on the desired VKR version:
    kubectl get pods -A

    A pod looking to run on a node with the desired VKR would have a Node-Selector noting the VKR version, similar to the below:

    kubectl describe pod -n <pod namespace> <pod name>
    
    Node-Selectors:
              run.tanzu.vmware.com/tkr=v#.##.##---vmware.#-fips-vkr.#

     

  • The same system pods in any of the above states have another replica in Running state, on the old VKR version.
    System pods follow a similar rolling redeployment structure of creating a new pod with new version and images then waiting for that new pod to become healthy before deleting the older pod.
    • Do not delete healthy system pods in this scenario.
      Deleted system pods in this scenario will try to spin up on the new version, becoming stuck in the same state and causing downtime for that system.

Environment

vSphere Supervisor

VKS Cluster

Cause

Unsupported actions performed during a VKS cluster VKR upgrade have resulted in the system becoming stuck trying to complete a rolling redeployment change to nodes within the affected cluster but the VKR upgrade has triggered an update to the package versions within the affected VKS cluster.

As a result, the system is keeping the nodes on the old VKR version, but trying to create system pods using images only on the new VKR version which results in this stuck state.

Resolution

DISCLAIMER: Because this issue is caused by an unsupported actions performed in the environment, the internal steps to resolve this issue are not guaranteed to work and a full redeployment of the VKS cluster may be necessary.

Reach out to VMware by Broadcom Technical Support referencing this KB article.