vSphere Supervisor Workload Cluster Upgrade Stuck to KR v1.31.1 due to Custom Antrea Resources
search cancel

vSphere Supervisor Workload Cluster Upgrade Stuck to KR v1.31.1 due to Custom Antrea Resources

book

Article ID: 384095

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service vSphere with Tanzu

Issue/Introduction

A workload cluster upgrade is stuck upgrading to KR v1.31.1.

 

While connected to the Supervisor cluster context, one or more of the following symptoms are observed:

  • New control plane nodes were created and reached Running on the desired upgrade version, but are continuously recreating every 10 - 15 minutes:
    kubectl get machine -n <workload cluster namespace>
    <workload cluster namespace>   machine.cluster.x-k8s.io/<new node name> <workload cluster> vsphere://<vsphere id>   Running   10m  <KR v1.31.1
    In this scenario, the workload cluster's worker nodepools have not upgraded to the desired version yet because the workload cluster's control plane nodes are not all healthy.

  • A new node was created on the desired upgrade version but remains stuck in Provisioned state:
    kubectl get machine -n <workload cluster namespace>
    <workload cluster namespace>   machine.cluster.x-k8s.io/<new node name>    <workload cluster>   vsphere://<vsphere id>   Provisioned   ##m  <KR v1.31.1
  • Describing the affected cluster show one or more errors similar to the following:
    kubectl describe cluster -n <workload cluster namespace> <workload cluster name>
    
    * NodeHealthy:
      * Node.Ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
 
The below error example is specific to the "Tier" CRD. If a different CRD is used, a corresponding error to that CRD will be present:
status:
conditions:
lastTransitionTime: "YYYY-MM-DDTHH:MM:SSZ"


message: |-

kapp: Error: update customresourcedefinition/tiers.crd.antrea.io (apiextensions.k8s.io/v1) cluster:

Updating resource customresourcedefinition/tiers.crd.antrea.io (apiextensions.k8s.io/v1) cluster:

API server says:

CustomResourceDefinition.apiextensions.k8s.io "tiers.crd.antrea.io" is invalid: status.storedVersions[0]:

Invalid value: "v1alpha1": must appear in spec.versions (reason: Invalid)

 

While connected to the affected workload cluster context, the following symptoms are observed:

  • One or more antrea pods are 1/2 Running, ImagePullBackOff, ErrImagePull or CrashLoopBackOff state:
    kubectl get pods -A | grep antrea
    NAMESPACE                      NAME                    READY   STATUS                 
    kube-system                    antrea-agent-<id-1>      0/2    Init:ErrImagePull
    kube-system                    antrea-agent-<id-2>      0/2   Init:ImagePullBackOff
    kube-system                    antrea-agent-<id-3>      1/2    Running
    kube-system                    antrea-controller-<id>   0/1    CrashLoopBackOff


  • When viewing the logs of the antrea-controller pod stuck in CrashLoopBackOff, an error message similar to the following is present:
    kubectl logs -n kube-system <antrea-controller-pod>
    
    Starting Antrea Controller (version v1.15.1-ea6613a)
    Error running controller: failed to clean up the deprecated APIServices: apiservices.apiregistration.k8s.io "v1beta1.networking.antrea.tanzu.vmware.com" is forbidden: User "system:serviceaccount:kube-system:antrea-controller" cannot delete resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope

Environment

vSphere Supervisor 8.0

vSphere Supervisor 9.0

VKS Service 3.2.0 and higher

Workload Cluster upgrading to KR v1.31.1

Cause

In vSphere Supervisor, KR v1.31.1 includes Antrea version 2.1 which retires the following advanced Antrea CRDs from earlier versions.

This issue can occur if any of these advanced API's were being used in the workload cluster prior to being upgraded to KR v1.31.1:

CRD CRD version Introduced In Deprecated In Removed in
ClusterGroup v1alpha2 v1.0.0 v1.1.0 v2.0.0
ClusterGroup v1alpha3 v1.1.0 v1.13.0 v2.0.0
ClusterNetworkPolicy v1alpha1 v1.0.0 v1.13.0 v2.0.0
Egress v1alpha2 v1.0.0 v1.13.0 v2.0.0
ExternalEntity v1alpha1 v0.10.0 v0.11.0 v2.0.0
ExternalIPPool v1alpha2 v1.8.0 v1.13.0 v2.0.0
Group v1alpha3 v1.8.0 v1.13.0 v2.0.0
NetworkPolicy v1alpha1 v1.0.0 v1.13.0 v2.0.0
Tier v1alpha1 v1.0.0 v1.13.0 v2.0.0
Traceflow v1alpha1 v1.0.0 v1.13.0 v2.0.0

Resolution

Initial Checks

  1. Connect into the affected workload cluster context
    Note: It may not be possible to use the workload cluster context if the control plane nodes of the affected cluster are recreating due to this issue.
    The below steps would need to be performed while SSH into a control plane node of the affected cluster in this scenario.

  2. Check if there are any antrea-pre-upgrade jobs or pods in the cluster:
    kubectl get pods -A | grep antrea-pre
    
  3. If there are antrea-pre-upgrade pods in the cluster, see the section below for Workaround A - Troubleshoot Antrea-Pre-Upgrade Job
    VMware by Broadcom Engineering implemented an antrea-pre-upgrade-job to automatically fix this antrea CRD deprecation issue.

  4. If there are no antrea-pre-upgrade job pods in the cluster, follow the below Workaround B - CRD Migration using antctl

 

Workaround A - Troubleshoot Antrea-Pre-Upgrade Job

  1. Connect into the affected workload cluster context

  2. Check the status of the antrea-pre-upgrade pods and job:
    kubectl get pods,jobs -A | grep antrea-pre
  3. If there are any antrea-pre-upgrade pods that did not run to completion, view the logs from the failed antrea-pre-upgrade pod:
    kubectl logs -n vmware-system-antrea <antrea-pre-upgrade-pod name>

    Failed antrea-pre-upgrade pods can be cleaned up without issue.

  4. If the antrea-pre-upgrade job is in failed state, describe it for details on why it failed:
    kubectl describe job -n vmware-system-antrea <antrea-pre-upgrade-job job>
  5. Confirm on the status of the antrea app:
    kubectl get app -n vmware-system-tkg | grep antrea
    
    kubectl describe app -n vmware-system-tkg <workload cluster name>-antrea
  6. Check if the antrea application shows ReconcileFailed or the antrea pre-upgrade job shows Failed with the following errors:
    usefulErrorMessage: |-
        kapp: Error: waiting on reconcile job/antrea-pre-upgrade-job (batch/v1) namespace: vmware-system-antrea:
          Finished unsuccessfully (Failed with reason BackoffLimitExceeded: Job has reached the specified backoff limit)
  7. If the antrea application or antrea pre-upgrade-job shows the above Back Off Limit errors, proceed to Workaround B below.

 

Workaround B - CRD Migration using antctl

  1. Connect into the affected workload cluster context as an user with administrator or root privileges
    Alternatively, SSH into one of the control plane nodes.

  2. List all antrea pods:
    kubectl get pods -A | grep antrea
  3. Pull antctl from one of the antrea-agent pods:
    kubectl cp <antrea-agent-pod>:/usr/local/bin/antctl antctl -n kube-system

    If the above command does not work, antctl CLI can be downloaded from the links in Additional Information.

  4. Confirm that antctl was pulled successfully and change its file permissions to be readable and executable:
    ls -ltr
    chmod 555 antctl
  5. Locate the antrea package install and its namespace:
    kubectl get pkgi -A | grep antrea
    
  6. Pause the antrea package install app:
    kubectl patch pkgi <workload cluster name>-antrea -n <antrea namespace> --type merge -p '{"spec":{"paused": true}}'
    
  7. Back up the following antrea webhooks:
    kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io crdvalidator.antrea.io -o yaml > antrea-vwhc-backup.yaml
    kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io crdmutator.antrea.io -o yaml > antrea-mwhc-backup.yaml
    
  8. Delete the backed up webhooks:
    CAUTION: Only delete the backed up antrea webhooks. Deletion of other webhooks will cause potentially irrecoverable issues.
    kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io crdvalidator.antrea.io
    kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io crdmutator.antrea.io
  9. Use the antrea cli tool "antctl" to manually perform the migration of old antrea CRDs to new CRDs:
    ./antctl upgrade api-storage --dry-run
    ./antctl upgrade api-storage
  10. Un-pause the antrea package install app which was paused in Step 3:
    kubectl patch pkgi <workload cluster name>-antrea -n <antrea namespace> --type merge -p '{"spec":{"paused": false}}' 
  11. If there is an antrea-pre-upgrade job and it is still failing, locate the antrea-pre-upgrade job, take a back-up of it and delete it to allow the antrea app to recreate it:
    kubectl get job -n vmware-system-antrea | grep antrea
    
    kubectl get job -n vmware-system-antrea antrea-pre-upgrade-job -o yaml > antrea-pre-upgrade-job-backup.yaml
    
    kubectl delete job -n vmware-system-antrea antrea-pre-upgrade-job
  12. Trigger the antrea application to immediately reconcile and recreate the antrea-pre-upgrade-job (if applicable):
    kubectl patch app <workload cluster name>-antrea -n <antrea namespace> --type='merge' -p '{"spec":{"syncPeriod":"9m"}}'

    The above command harmlessly changes the syncPeriod of the antrea application which causes an immediate reconciliation because a change was made to the app.
    If multiple reconciliations are needed, this value can be changed back and forth between 9m and 10m.

  13. If applicable, check that the antrea-pre-upgrade-job recreates, creates antrea-pre-upgrade pod and both run successfully to completion:
    kubectl get jobs,pods -n vmware-system-antrea
  14. The upgrade will progress once all antrea pods are stabilized:
    kubectl get pods -A | grep antrea
  15. Once all antrea pods are upgraded to the KR v1.31 version (automatically done as a part of the KR upgrade), then the antrea app and pkgi will show healthy in ReconcileSucceeded state:
    kubectl get app,pkgi -A | grep antrea
  16. It is expected for the antrea webhooks to be automatically recreated once antrea is healthy.

Additional Information

There is an Antrea cli tool called antctl which will migrate the objects from the old CRD's to the new CRD's.

Alternatively, it can be downloaded at the bottom of the following page under Assets: https://github.com/antrea-io/antrea/releases/tag/v2.1.0

---------

KR v1.31.1 Release Notes

KR v1.31.4 Release Notes