VKS Cluster Upgrade Stuck error reconciling the Cluster topology
search cancel

VKS Cluster Upgrade Stuck error reconciling the Cluster topology

book

Article ID: 433032

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

A vSphere Kubernetes Service (VKS) cluster upgrade is stuck with a Cluster topology reconcile error.

 

While connected to the Supervisor cluster context, the following symptoms are observed:

  • The affected VKS cluster and clusterbootstrap are on the desired vSphere Kubernetes Release (VKR) version:
    kubectl get cluster,clusterbootstrap -n <affected cluster namespace> <affected cluster name>

     

  • Describing the affected VKS cluster shows an error similar to the following:
    conditions:
      - lastTransitionTime: "YYYY-MM-DDTHH:MM:SSZ"
        message: '* TopologyReconciled: error reconciling the Cluster topology: failed
          to create patch helper for KubeadmControlPlane <namespace>/<cluster name>-<kcp id>:
          server side apply dry-run failed for modified object: KubeadmControlPlane.controlplane.cluster.x-k8s.io
          "<cluster name>-<kcp id>" is invalid: [spec.kubeadmConfigSpec.initConfiguration.nodeRegistration.kubeletExtraArgs:
          Invalid value: "array": kubeletExtraArgs name must be unique, spec.kubeadmConfigSpec.joinConfiguration.nodeRegistration.kubeletExtraArgs:
          Invalid value: "array": kubeletExtraArgs name must be unique]'

     

  • However, other cluster objects are still on the old VKR version:
    kubectl get kcp,md,machines -n <affected VKS cluster namespace>

Environment

vSphere Supervisor

VKS Cluster

VKS 3.5 or 3.6

Cause

A Kubernetes object in the environment is using both v1beta1 and v1beta2 managedFields but was unable to be converted by the conversion webhook which could be due to issues such as network unavailability. If an existing write operation (such as a VKR upgrade) on the object was interrupted, kube-apiserver can drop all managedFields in that object.

This is a known upstream Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/136919

Resolution

  1. Connect into the Supervisor cluster context

  2. Note down the name of the affected VKS cluster and VKR version:
    kubectl get cluster -n <affected VKS cluster namespace>
    
    kubectl describe cluster -n <affected VKS cluster namespace> <affected VKS cluster name>

     

  3. Document the name of the KubeadmControlPlane (KCP) for the affected VKS cluster:
    kubectl get kcp -n <affected vks cluster namespace>
    
    NAME                                                 CLUSTER
    <affected vks cluster name>-<kcp id>       <affected vks cluster name>

     

  4. Check if the following command returns any output for the KCP YAML for the affected VKS cluster:
    kubectl get kcp -o yaml --show-managed-fields -ojsonpath='{.metadata.managedFields}' -n <affected vks cluster namespace> <affected vks cluster name>-<kcp id>


  5. Confirm if there is an entry for manager.before-first-apply in the KCP YAML for the affected VKS cluster:
    kubectl get kcp -o yaml --show-managed-fields -n <affected vks cluster namespace> <affected vks cluster name>-<kcp id>  | grep "before-first-apply"
    manager: before-first-apply

     

  6. If Step 4 returns no output or Step 5 returns manager: before-first-apply is present, proceed with the rest of this KB.


  7. Download the script attached to this KB article and move it to one of the Supervisor Control Plane VMs:


  8. SSH to the Supervisor Control Plane VM:
    See "How to SSH into Supervisor Control Plane VMs" in KB Troubleshooting vSphere Supervisor Control Plane VMs


  9. Perform a dry-run of the attached script:
    python mitigate-managed-fields.py -n <affected vks cluster namespace> --cluster <affected vks cluster name> --dry-run

     

  10. If the dry-run returns successful, run the attached script without the dry-run flag:
    python mitigate-managed-fields.py -n <affected vks cluster namespace> --cluster <affected vks cluster name>

     

  11. Check that the VKS cluster no longer shows the Cluster topology error from before:
    kubectl describe cluster -n <affected vks cluster namespace> <affected vks cluster name>

     

  12. The VKS cluster's upgrade should progress at this point, if there are no other environmental issues.

Attachments

mitigate-managed-fields.py get_app