TKG Service Cluster upgrade may get stuck due to Antrea pods preventing worker nodes from being drained
search cancel

TKG Service Cluster upgrade may get stuck due to Antrea pods preventing worker nodes from being drained

book

Article ID: 387807

calendar_today

Updated On:

Products

VMware vSphere with Tanzu

Issue/Introduction

When upgrading a Tanzu Kubernetes Grid (TKG) service cluster, there is a known issue with the Antrea Container Network Interface (CNI) where antrea-resource-init-xxxx-xxx pods may continue to get re-created on nodes that are in a cordoned state. This occurs because the deployment for antrea-resource-init includes tolerations for cordoned nodes. Since the antrea-resource-init pods keep getting scheduled into the cordoned nodes, the nodes aren't able to be drained and the upgrade is blocked. TKG service cluster nodes are placed into a cordoned state before being removed from the cluster. 

The capi-controller-manager pod repeatedly shows a new antrea-resource-init-xxxxxxx-xxx pod being evicted. The antrea-resource-init pod will have a different name in each log entry:

I1116 21:30:44.866941 1 machine_controller.go:867] "evicting pod kube-system/antrea-resource-init-xxxxxxx-xxx\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="<MACHINE_NAME>" namespace="<NAMESPACE" name="<MACHINE_NAME>" reconcileID=<RECONCILE_ID> KubeadmControlPlane="<KUBEADM_CONTROL_PLANE>" Cluster="<CLUSTER>" Node="<NODE>"

There will be a corresponding log entry for failing to drain the node waiting for the antrea-resource-init-xxxxxxx-xxx pod to be evicted. 

E1116 21:31:06.273056 1 machine_controller.go:641] "Drain failed, retry in 20s" err="error when waiting for pod \"antrea-resource-init-xxxxxxx-xxx\" terminating: global timeout reached: 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="<MACHINE_NAME>" namespace="<NAMESPACE>" name="<MACHINE_NAME>" reconcileID=<RECONCILE_ID> KubeadmControlPlane="<KUBEADM_CONTROL_PLANE>" Cluster="<CLUSTER>" Node="<NODE>"

You can view the capi-controller-manager pod logs by performing the following:

  1. Log into the supervisor cluster using kubectl
  2. Get the svc-tkg-domain namespace:
    kubectl get ns | grep svc-tkg-domain
    svc-tkg-domain-<ID>                          
  3. View capi-controller-manager pod logs:
    kubectl logs -n svc-tkg-domain-<ID> -l name=capi-controller-manager -c manager -f

Environment

TKR prior to v1.28

Resolution

  1. Log into the TKG service cluster that's being upgraded using kubectl: Connect to a TKG Service Cluster as a vCenter Single Sign-On User with Kubectl
  2. Scale the antrea-resource-init deployment to 0 replicas:
    kubectl -n kube-system scale deployment antrea-resource-init --replicas 0
  3. This should allow the node to be drained and the upgrade process to continue
  4. Once the upgrade is complete, scale the antrea-resource-init deployment back to 1 replicas:
    kubectl -n kube-system scale deployment antrea-resource-init --replicas 1