VKS Cluster upgrade stuck from v1.33.x to v1.34.x due to antrea-controller pod failing
search cancel

VKS Cluster upgrade stuck from v1.33.x to v1.34.x due to antrea-controller pod failing

book

Article ID: 431022

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After initiating a workload cluster upgrade from VKR v1.33.x to v1.34.X, the upgrade is stuck because the workload cluster's antrea-controller is failing to start.

While connected to the affected workload cluster context, one or more of the following symptoms are observed:

  • Performing a describe on the new workload cluster node stuck in Provisioned state shows the following error regarding antrea CNI failing:
    "Container runtime network not ready" 
    networkReady="NetworkReady=false reason:NetworkPluginNotReady 
    message:Network plugin returns error: cni plugin not initialized"


  • The antrea-controller pod is stuck in ImagePullBackOff or ErrImagePull state:
    kubectl get pods -n kube-system

    Describing the antrea-controller pod shows the following image pull error message:

    kubectl describe pod -n kube-system <antrea-controller-pod>
    
    Back-off pulling image \\\"localhost:5000/tkg/packages/core/antrea@sha256:<sha>\\\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \\\"localhost:5000/tkg/packages/core/antrea@sha256:<sha>\\\": failed to resolve reference \\\"localhost:5000/tkg/packages/core/antrea@sha256:<sha>\\\": localhost:5000/tkg/packages/core/antrea@sha256:<sha>: not found\

     

  • The antrea-controller pod is stuck in ContainerCreating state with an error message similar to the below:
    Couldn't get configMap kube-system/antrea-config-ver-#: configmap "antrea-config-ver-#" not found.

     

     
  • The antrea-controller pod is stuck in Pending state due to the presence of other failed antrea-controller pods:
    kubectl get pods -n kube-system

     

  • There are multiple antrea-controller replicasets with the desired state of 1:
    kubectl get replicaset -n kube-system | grep antrea
    
    NAME                       DESIRED      CURRENT     READY
    antrea-controller-<id-a>   1              #          #
    antrea-controller-<id-b>   1              #          #

     

  • Other system pods such as vsphere-csi-controller are stuck in ContainerCreating state because these system pods are dependent on antrea-controller health.

Environment

vSphere Supervisor

VKR upgrade from v1.33.X to v1.34.X

Cause

Antrea-controller is unable to start properly due to a race condition on its start-up.

This issue has been observed more frequently in workload clusters using a single control plane node.

Resolution

Resolution

This issue will be resolved in a future VKS version.

 

Workaround

The antrea-controller deployment will need to be manually recreated by the system.

  1. Connect into the affected workload cluster's context


  2. Take a backup of the antrea-controller deployment:
    kubectl get deployment -n kube-system antrea-controller -o yaml > antrea-controller-deploy-backup.yaml


  3. Perform a deletion of the antrea-controller deployment:
    kubectl delete deployment -n kube-system antrea-controller

     

  4. Force reconcile the antrea packageInstall (PKGI) to quickly recreate the antrea-controller deployment:
    kubectl patch pkgi -n vmware-system-tkg cluster-antrea --type='merge' -p '{"spec":{"syncPeriod":"9m"}}'

     

  5. Monitor for the deployment's recreation by the system:
    kubectl get deployment -n kube-system

     

  6. Confirm that the antrea-controller pod reaches and remains in Running state:
    kubectl get pods -n kube-system

     

  7. Verify that the workload cluster upgrade is progressing now.

 

If the antrea-controller deployment needs to be reverted, the below command can be used to revert it:

kubectl apply -f antrea-controller-deploy-backup.yaml