vSphere Supervisor Workload Cluster Upgrade Stuck to KR v1.31.1 due to Custom Antrea Resources

Products

VMware vSphere Kubernetes Service vSphere with Tanzu

Issue/Introduction

A workload cluster upgrade is stuck upgrading to KR v1.31.1.

While connected to the Supervisor cluster context, one or more of the following symptoms are observed:

New control plane nodes were created and reached Running on the desired upgrade version, but are continuously recreating every 10 - 15 minutes:
```
kubectl get machine -n <workload cluster namespace>
```
```
<workload cluster namespace>   machine.cluster.x-k8s.io/<new node name> <workload cluster> vsphere://<vsphere id>   Running   10m  <KR v1.31.1
```
In this scenario, the workload cluster's worker nodepools have not upgraded to the desired version yet because the workload cluster's control plane nodes are not all healthy.

A new node was created on the desired upgrade version but remains stuck in Provisioned state:

kubectl get machine -n <workload cluster namespace>

<workload cluster namespace>   machine.cluster.x-k8s.io/<new node name>    <workload cluster>   vsphere://<vsphere id>   Provisioned   ##m  <KR v1.31.1

Describing the affected cluster show one or more errors similar to the following:

kubectl describe cluster -n <workload cluster namespace> <workload cluster name>

* NodeHealthy:
  * Node.Ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

The below error example is specific to the "Tier" CRD. If a different CRD is used, a corresponding error to that CRD will be present:

status:
conditions:
lastTransitionTime: "YYYY-MM-DDTHH:MM:SSZ"

message: |-

kapp: Error: update customresourcedefinition/tiers.crd.antrea.io (apiextensions.k8s.io/v1) cluster:

Updating resource customresourcedefinition/tiers.crd.antrea.io (apiextensions.k8s.io/v1) cluster:

API server says:

CustomResourceDefinition.apiextensions.k8s.io "tiers.crd.antrea.io" is invalid: status.storedVersions[0]:

Invalid value: "v1alpha1": must appear in spec.versions (reason: Invalid)

While connected to the affected workload cluster context, the following symptoms are observed:

One or more antrea pods are 1/2 Running, ImagePullBackOff, ErrImagePull or CrashLoopBackOff state:

kubectl get pods -A | grep antrea

NAMESPACE                      NAME                    READY   STATUS                 
kube-system                    antrea-agent-<id-1>      0/2    Init:ErrImagePull
kube-system                    antrea-agent-<id-2>      0/2    Init:ImagePullBackOff
kube-system                    antrea-agent-<id-3>      1/2    Running
kube-system                    antrea-controller-<id>   0/1    CrashLoopBackOff

When viewing the logs of the antrea-controller pod stuck in CrashLoopBackOff, an error message similar to the following is present:

kubectl logs -n kube-system <antrea-controller-pod>

Starting Antrea Controller (version v1.15.1-ea6613a)
Error running controller: failed to clean up the deprecated APIServices: apiservices.apiregistration.k8s.io "v1beta1.networking.antrea.tanzu.vmware.com" is forbidden: User "system:serviceaccount:kube-system:antrea-controller" cannot delete resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope

Environment

vSphere Supervisor 8.0

vSphere Supervisor 9.0

VKS Service 3.2.0 and higher

Workload Cluster upgrading to KR v1.31.1

Cause

In vSphere Supervisor, KR v1.31.1 includes Antrea version 2.1 which retires the following advanced Antrea CRDs from earlier versions.

This issue can occur if any of these advanced API's were being used in the workload cluster prior to being upgraded to KR v1.31.1:

CRD	CRD version	Introduced In	Deprecated In	Removed in
ClusterGroup	v1alpha2	v1.0.0	v1.1.0	v2.0.0
ClusterGroup	v1alpha3	v1.1.0	v1.13.0	v2.0.0
ClusterNetworkPolicy	v1alpha1	v1.0.0	v1.13.0	v2.0.0
Egress	v1alpha2	v1.0.0	v1.13.0	v2.0.0
ExternalEntity	v1alpha1	v0.10.0	v0.11.0	v2.0.0
ExternalIPPool	v1alpha2	v1.8.0	v1.13.0	v2.0.0
Group	v1alpha3	v1.8.0	v1.13.0	v2.0.0
NetworkPolicy	v1alpha1	v1.0.0	v1.13.0	v2.0.0
Tier	v1alpha1	v1.0.0	v1.13.0	v2.0.0
Traceflow	v1alpha1	v1.0.0	v1.13.0	v2.0.0

Resolution

Initial Checks

Connect into the affected workload cluster context
Note: It may not be possible to use the workload cluster context if the control plane nodes of the affected cluster are recreating due to this issue.
The below steps would need to be performed while SSH into a control plane node of the affected cluster in this scenario.
Check if there are any antrea-pre-upgrade jobs or pods in the cluster:
```
kubectl get pods -A | grep antrea-pre
```
If there are antrea-pre-upgrade pods in the cluster, see the section below for Workaround A - Troubleshoot Antrea-Pre-Upgrade Job
VMware by Broadcom Engineering implemented an antrea-pre-upgrade-job to automatically fix this antrea CRD deprecation issue.
If there are no antrea-pre-upgrade job pods in the cluster, follow the below Workaround B - CRD Migration using antctl

Workaround A - Troubleshoot Antrea-Pre-Upgrade Job

Connect into the affected workload cluster context
Check the status of the antrea-pre-upgrade pods and job:
```
kubectl get pods,jobs -A | grep antrea-pre
```
If there are any antrea-pre-upgrade pods that did not run to completion, view the logs from the failed antrea-pre-upgrade pod:
```
kubectl logs -n vmware-system-antrea <antrea-pre-upgrade-pod name>
```
Failed antrea-pre-upgrade pods can be cleaned up without issue.
If the antrea-pre-upgrade job is in failed state, describe it for details on why it failed:
```
kubectl describe job -n vmware-system-antrea <antrea-pre-upgrade-job job>
```

Confirm on the status of the antrea app:

kubectl get app -n vmware-system-tkg | grep antrea

kubectl describe app -n vmware-system-tkg <workload cluster name>-antrea

Check if the antrea application shows ReconcileFailed or the antrea pre-upgrade job shows Failed with the following errors:

usefulErrorMessage: |-
    kapp: Error: waiting on reconcile job/antrea-pre-upgrade-job (batch/v1) namespace: vmware-system-antrea:
      Finished unsuccessfully (Failed with reason BackoffLimitExceeded: Job has reached the specified backoff limit)

If the antrea application or antrea pre-upgrade-job shows the above Back Off Limit errors, proceed to Workaround B below.

Workaround B - CRD Migration using antctl

Connect into the affected workload cluster context as an user with administrator or root privileges
Alternatively, SSH into one of the control plane nodes.
List all antrea pods:
```
kubectl get pods -A | grep antrea
```
Pull antctl from one of the antrea-agent pods:
```
kubectl cp <antrea-agent-pod>:/usr/local/bin/antctl antctl -n kube-system
```
If the above command does not work, antctl CLI can be downloaded from the links in Additional Information.
Confirm that antctl was pulled successfully and change its file permissions to be readable and executable:
```
ls -ltr
chmod 555 antctl
```
Locate the antrea package install and its namespace:
```
kubectl get pkgi -A | grep antrea
```

Pause the antrea package install app:

kubectl patch pkgi <workload cluster name>-antrea -n <antrea namespace> --type merge -p '{"spec":{"paused": true}}'

Back up the following antrea webhooks:

kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io crdvalidator.antrea.io -o yaml > antrea-vwhc-backup.yaml
kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io crdmutator.antrea.io -o yaml > antrea-mwhc-backup.yaml

Delete the backed up webhooks:
CAUTION: Only delete the backed up antrea webhooks. Deletion of other webhooks will cause potentially irrecoverable issues.

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io crdvalidator.antrea.io
kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io crdmutator.antrea.io

Use the antrea cli tool "antctl" to manually perform the migration of old antrea CRDs to new CRDs:
```
./antctl upgrade api-storage --dry-run
./antctl upgrade api-storage
```

Un-pause the antrea package install app which was paused in Step 3:

kubectl patch pkgi <workload cluster name>-antrea -n <antrea namespace> --type merge -p '{"spec":{"paused": false}}'

If there is an antrea-pre-upgrade job and it is still failing, locate the antrea-pre-upgrade job, take a back-up of it and delete it to allow the antrea app to recreate it:

kubectl get job -n vmware-system-antrea | grep antrea

kubectl get job -n vmware-system-antrea antrea-pre-upgrade-job -o yaml > antrea-pre-upgrade-job-backup.yaml

kubectl delete job -n vmware-system-antrea antrea-pre-upgrade-job

Trigger the antrea application to immediately reconcile and recreate the antrea-pre-upgrade-job (if applicable):
```
kubectl patch app <workload cluster name>-antrea -n <antrea namespace> --type='merge' -p '{"spec":{"syncPeriod":"9m"}}'
```
The above command harmlessly changes the syncPeriod of the antrea application which causes an immediate reconciliation because a change was made to the app.
If multiple reconciliations are needed, this value can be changed back and forth between 9m and 10m.
If applicable, check that the antrea-pre-upgrade-job recreates, creates antrea-pre-upgrade pod and both run successfully to completion:
```
kubectl get jobs,pods -n vmware-system-antrea
```
The upgrade will progress once all antrea pods are stabilized:
```
kubectl get pods -A | grep antrea
```
Once all antrea pods are upgraded to the KR v1.31 version (automatically done as a part of the KR upgrade), then the antrea app and pkgi will show healthy in ReconcileSucceeded state:
```
kubectl get app,pkgi -A | grep antrea
```
It is expected for the antrea webhooks to be automatically recreated once antrea is healthy.