When upgrading a Tanzu Kubernetes Grid (TKG) service cluster, there is a known issue with the Antrea Container Network Interface (CNI) where antrea-resource-init-xxxx-xxx pods may continue to get re-created on nodes that are in a cordoned state. This occurs because the deployment for antrea-resource-init includes tolerations for cordoned nodes. Since the antrea-resource-init pods keep getting scheduled into the cordoned nodes, the nodes aren't able to be drained and the upgrade is blocked. TKG service cluster nodes are placed into a cordoned state before being removed from the cluster.
The capi-controller-manager pod repeatedly shows a new antrea-resource-init-xxxxxxx-xxx pod being evicted. The antrea-resource-init pod will have a different name in each log entry:
I1116 21:30:44.866941 1 machine_controller.go:867] "evicting pod kube-system/antrea-resource-init-xxxxxxx-xxx\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="<MACHINE_NAME>" namespace="<NAMESPACE" name="<MACHINE_NAME>" reconcileID=<RECONCILE_ID> KubeadmControlPlane="<KUBEADM_CONTROL_PLANE>" Cluster="<CLUSTER>" Node="<NODE>"
There will be a corresponding log entry for failing to drain the node waiting for the antrea-resource-init-xxxxxxx-xxx pod to be evicted.
E1116 21:31:06.273056 1 machine_controller.go:641] "Drain failed, retry in 20s" err="error when waiting for pod \"antrea-resource-init-xxxxxxx-xxx\" terminating: global timeout reached: 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="<MACHINE_NAME>" namespace="<NAMESPACE>" name="<MACHINE_NAME>" reconcileID=<RECONCILE_ID> KubeadmControlPlane="<KUBEADM_CONTROL_PLANE>" Cluster="<CLUSTER>" Node="<NODE>"
You can view the capi-controller-manager pod logs by performing the following:
kubectl get ns | grep svc-tkg-domain
svc-tkg-domain-<ID>
kubectl logs -n svc-tkg-domain-<ID> -l name=capi-controller-manager -c manager -f
TKR prior to v1.28
kubectl -n kube-system scale deployment antrea-resource-init --replicas 0
kubectl -n kube-system scale deployment antrea-resource-init --replicas 1