Cluster creation fails on TKG 2.1.1 when ANTREA_DISABLE_UDP_TUNNEL_OFFLOAD is set
search cancel

Cluster creation fails on TKG 2.1.1 when ANTREA_DISABLE_UDP_TUNNEL_OFFLOAD is set

book

Article ID: 313121

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:

The management cluster fails to install and stalls during cert-manager installation.

Antrea pods are not deployed and this causes various pods including cert-manager, core-dns, csi-controller and tanzu-capabilities-controller-manager to be in state Pending.

Antrea package and its corresponding secret is present in the cluster

kubectl  get apps -n tkg-system
NAMESPACE    NAME                         DESCRIPTION                                                                       SINCE-DEPLOY   AGE
tkg-system   mgmt-capabilities           Reconciling                                                                       3m21s          3m41s
tkg-system   mgmt-metrics-server         Reconcile failed: Deploying: Error (see .status.usefulErrorMessage for details)   47s            3m44s
tkg-system   mgmt-pinniped               Reconcile succeeded                                                               3m26s          3m42s
tkg-system   mgmt-secretgen-controller   Reconcile failed: Deploying: Error (see .status.usefulErrorMessage for details)   47s            3m43s
tkg-system   mgmt-tkg-storageclass       Reconcile succeeded                                                               3m32s          3m43s
tkg-system   mgmt-vsphere-cpi            Reconcile succeeded                                                               3m42s          3m44s
tkg-system   mgmt-vsphere-csi            Reconcile failed: Deploying: Error (see .status.usefulErrorMessage for details)   46s            3m44s

kubectl  get secrets -n tkg-system | grep data-values
tkg-system    mgmt-capabilities-data-values       Opaque                           1      5m39s
tkg-system    mgmt-pinniped-data-values           Opaque                           1      5m39s
tkg-system    mgmt-tkg-storageclass-data-values   Opaque                           1      5m40s
tkg-system    mgmt-vsphere-cpi-data-values        Opaque                           1      5m41s
tkg-system    mgmt-vsphere-csi-data-values        Opaque                           1      5m41s


For workload classy cluster, the same symptoms exist. Pods in Pending state because Antrea is not deployed. Antrea app and secrets not present.

For workload legacy clusters, some of the same symptoms exist but not all. Antrea pods are not created and consequently several pods are in Pending state.


Environment

VMware Tanzu Kubernetes Grid Plus 1.x

Cause

There is a known issue in TKG 2.1.1 where Antrea resources are not created when ANTREA_DISABLE_UDP_TUNNEL_OFFLOAD="true" in cluster configuration file.
It has also been observed with other ANTREA config file variables such as ANTREA_NODEPORTLOCAL.

Resolution

Resolved in TKG 2.1.2 and 2.2

Workaround:

Classy cluster

Management Cluster
If the issue is encountered during the creation of a classy Management cluster, the following steps can be performed to workaround the issue.
Target the kind/bootstrap cluster and  monitor for creation of antreaconfig object during management cluster creation
NOTE: Bootstrap cluster kubeconfig will be available in ~/.kube-tkg/tmp
export KUBECONFIG=<Bootstrap kube-config>
kubectl get antreaconfig -n tkg-system



Add the following label to antreaconfig object for the cluster, replace cluster name.
# kubectl edit antreaconfig -n tkg-system <Management cluster name>

  labels:
    tkg.tanzu.vmware.com/cluster-name: <cluster name>
    tkg.tanzu.vmware.com/package-name: antrea.tanzu.vmware.com.1.7.2---vmware.1-tkg.1-advanced

Restart tanzu-addons-controller-manager on the bootstrap cluster and Mgmt cluster installation will continue
kubectl delete pod <tanzu-addons-controller-manager-pod> -n tkg-system


Workload Custer
When creating workload cluster, you can add the same labels to the AntreaConfig in the classy-cluster.yaml file
# vi classy-cluster.yaml

apiVersion: cni.tanzu.vmware.com/v1alpha1
kind: AntreaConfig
metadata:
  labels:
    tkg.tanzu.vmware.com/cluster-name: work3
    tkg.tanzu.vmware.com/package-name: antrea.tanzu.vmware.com.1.7.2---vmware.1-tkg.1-advanced


 

Plan based cluster

ANTREA_DISABLE_UDP_TUNNEL_OFFLOAD was introduced to avoid known issues with underlay network.
Check KB  https://kb.vmware.com/s/article/86496 and NSX version to confirm if workaround mentioned in the KB is still required.

If so, disable UDP Tunnel and cksum on the cluster nodes
ethtool -K eth0 tx-udp_tnl-segmentation off && ethtool -K eth0 tx-udp_tnl-csum-segmentation off

For workload legacy cluster, the attached overlay file cksum.yaml can be copied to overlay directory, eg ~/.config/tanzu/tkg/providers/ytt/03_customizations.

Attachments

cksum get_app