After Management Cluster Upgrade, CAPI Is Stuck in CrashLoopBackOff and DNS Resolution Is Failing

Products

Tanzu Kubernetes Runtime

Issue/Introduction

After upgrading the management cluster, you may observe that the capi-controller-manager pod enters a CrashLoopBackOff state. Upon reviewing the pod logs, you will find that it is failing to communicate with vCenter.

E0101 12:00:00.000000       1 controller.go:329] "Reconciler error" err="failed to create vCenter session: failed to create client: Post \"https://VCENTER-FQDN.local/sdk\": dial tcp: lookup VCENTER-FQDN.local on 198.51.100.1:53: read udp 198.51.100.1:40069->198.51.100.1:53: i/o timeout" controller="vspherevm" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="VSphereVM" VSphereVM="tkg-system/tkg-mgmt-58fjk-38jk9" namespace="tkg-system" name="tkg-mgmt-58fjk-38jk9" reconcileID=""

Additionally, the API server logs may show failures when attempting to communicate with multiple webhooks and services within the environment.

2025-01-01T12:00:00.000000000Z stderr F E0416 12:00:00.000000       1 cacher.go:479] cacher (clusters.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list cluster.x-k8s.io/v1alpha3, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 198.51.100.1:443: connect: connection refused; reinitializing...

2025-01-01T12:00:00.000000000Z stderr F Trace[1559110887]: ---"Objects listed" error:conversion webhook for run.tanzu.vmware.com/v1alpha3, Kind=TanzuKubernetesRelease failed: Post "https://tkr-conversion-webhook-service.tkg-system.svc:443/convert?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 30002ms 
 
2025-01-01T12:00:00.000000000Z stderr F E0416 10:37:16.288902       1 controller.go:146] Error updating APIService "v1alpha1.clientsecret.supervisor.pinniped.dev" with err: failed to download v1alpha1.clientsecret.supervisor.pinniped.dev: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable

Environment

TKGm 2.5.0+

Cause

Antrea enables Geneve tunnel checksum offload by default. However, sometimes the container networking traffic throughput drops to nearly zero. In packet capture we see that TCP 3-way handshake is successful but the first data packet in MTU size gets wrong checksum and it's dropped in the receiver side. This can happen when the K8s node VMs are running on overlay network and the underlay network cannot correctly process checksum offloading in double encapsulation scenario, or the physical NIC has a bug in checksum offloading.

This issue is typically caused by dropped DNS queries. As a result, services and webhooks in the environment that rely on fully qualified domain names (FQDNs) cannot be resolved.

You can check the below KB for more detail on the problem and how to confirm it: TKGm cluster in vSphere creation fails using NSX-T v3.1.x and Photon 3 or Ubuntu with Linux Kernel 5.8 VMs .

This is a known issue described in the release notes for Antrea 1.8.0 .

Resolution

To resolve this problem in a existing management cluster, you can modify the AntreaConfig object to enable the "disableUdpTunnelOffload" configuration parameter.

First, list AntreaConfig objects in the tkg-system namespace objects using:

ubuntu@jumpbox:~$ kubectl get AntreaConfig -n tkg-system
NAME                               TRAFFICENCAPMODE   DEFAULTMTU   ANTREAPROXY   ANTREAPOLICY   SECRETREF
mgmt-cluster-antrea-package      encap                           true          true           mgmt-cluster-antrea-data-values
v1.26.14---vmware.1-tiny.1-tkg.3   encap                           true          true
v1.26.14---vmware.1-tkg.4          encap                           true          true
v1.27.13---vmware.1-tiny.1-tkg.2   encap                           true          true
v1.27.15---vmware.1-tkg.2          encap                           true          true
v1.28.11---vmware.2-tkg.2          encap                           true          true
v1.28.9---vmware.1-tiny.1-tkg.2    encap                           true          true
v1.29.6---vmware.1-tkg.3           encap                           true          true
v1.30.2---vmware.1-tkg.1           encap                           true          true

Edit the object associated with the management cluster, in this case "mgmt-cluster-antrea-package" and set "disableUdpTunnelOffload" to true:

kubectl edit AntreaConfig -n tkg-system mgmt-cluster-antrea-package

...
spec:
  antrea:
    config:
      antreaProxy:
        proxyLoadBalancerIPs: true
      defaultMTU: ""
      disableTXChecksumOffload: false
      disableUdpTunnelOffload: true

After saving the change, restart the antrea-agent DaemonSet to apply it:

ubuntu@jumpbox:~$ kubectl get ds -n kube-system
NAME                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
antrea-agent                       2         2         2       2            2           kubernetes.io/os=linux   63d
kube-proxy                         2         2         2       2            2           kubernetes.io/os=linux   63d
vsphere-cloud-controller-manager   1         1         1       1            1           <none>                   63d

ubuntu@jumpbox:~$ kubectl rollout restart ds -n kube-system            antrea-agent
daemonset.apps/antrea-agent restarted

After the pods have been restarted, DNS resolution in your cluster should be restored. Any pods still in a CrashLoopBackOff state may need to be manually restarted to ensure full recovery.