VKS Cluster upgrade stuck from v1.33.x to v1.34.x due to antrea-controller pod failing

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After initiating a workload cluster upgrade from VKR v1.33.x to v1.34.X, the upgrade is stuck because the workload cluster's antrea-controller is failing to start.

While connected to the affected workload cluster context, one or more of the following symptoms are observed:

Performing a describe on the new workload cluster node stuck in Provisioned state shows the following error regarding antrea CNI failing:

"Container runtime network not ready" 
networkReady="NetworkReady=false reason:NetworkPluginNotReady 
message:Network plugin returns error: cni plugin not initialized"

The antrea-controller pod is stuck in ImagePullBackOff or ErrImagePull state:

kubectl get pods -n kube-system

Describing the antrea-controller pod shows the following image pull error message:

kubectl describe pod -n kube-system <antrea-controller-pod>

Back-off pulling image \\\"localhost:5000/tkg/packages/core/antrea@sha256:<sha>\\\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \\\"localhost:5000/tkg/packages/core/antrea@sha256:<sha>\\\": failed to resolve reference \\\"localhost:5000/tkg/packages/core/antrea@sha256:<sha>\\\": localhost:5000/tkg/packages/core/antrea@sha256:<sha>: not found\

The antrea-controller pod is stuck in ContainerCreating state with an error message similar to the below:
```
Couldn't get configMap kube-system/antrea-config-ver-#: configmap "antrea-config-ver-#" not found.
```
The antrea-controller pod is stuck in Pending state due to the presence of other failed antrea-controller pods:
```
kubectl get pods -n kube-system
```

There are multiple antrea-controller replicasets with the desired state of 1:

kubectl get replicaset -n kube-system | grep antrea

NAME                       DESIRED      CURRENT     READY
antrea-controller-<id-a>   1              #          #
antrea-controller-<id-b>   1              #          #

Other system pods such as vsphere-csi-controller are stuck in ContainerCreating state because these system pods are dependent on antrea-controller health.

Environment

vSphere Supervisor

VKR upgrade from v1.33.X to v1.34.X

Cause

Antrea-controller is unable to start properly due to a race condition on its start-up.

This issue has been observed more frequently in workload clusters using a single control plane node.

Resolution

This issue will be resolved in a future VKS version.

Workaround

The antrea-controller deployment will need to be manually recreated by the system.

Connect into the affected workload cluster's context

Take a backup of the antrea-controller deployment:

kubectl get deployment -n kube-system antrea-controller -o yaml > antrea-controller-deploy-backup.yaml

Perform a deletion of the antrea-controller deployment:

kubectl delete deployment -n kube-system antrea-controller

Force reconcile the antrea packageInstall (PKGI) to quickly recreate the antrea-controller deployment:

kubectl patch pkgi -n vmware-system-tkg cluster-antrea --type='merge' -p '{"spec":{"syncPeriod":"9m"}}'

Monitor for the deployment's recreation by the system:
```
kubectl get deployment -n kube-system
```
Confirm that the antrea-controller pod reaches and remains in Running state:
```
kubectl get pods -n kube-system
```
Verify that the workload cluster upgrade is progressing now.

If the antrea-controller deployment needs to be reverted, the below command can be used to revert it:

kubectl apply -f antrea-controller-deploy-backup.yaml