Disable accelerated networking on the Tanzu Kubernetes grid management clusters and workload clusters on Azure

Products

Tanzu Kubernetes Grid

Issue/Introduction

Disable accelerated networking on the existing running Tanzu kubernetes Grid cluster.

Symptoms:
You may face below error symptoms in the cloud init logs for the cluster.

[   56.628826] cloud-init[1365]: [2022-01-31 18:25:17] Cloud-init v. 20.4.1-0ubuntu1~20.04.1 finished at Mon, 31 Jan 2022 18:25:17 +0000. Datasource DataSourceAzure [seed=/dev/sr0].  Up 55.96 seconds
[  119.003842] mlx5_core f0f5:00:02.0 enP61685s1: Error cqe on cqn 0x330, ci 0x23d, sqn 0xc3, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[  119.004240] mlx5_core f0f5:00:02.0 enP61685s1: Error cqe on cqn 0x328, ci 0x25c, sqn 0xbb, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[  119.019588] mlx5_core f0f5:00:02.0 enP61685s1: ERR CQE on SQ: 0xbb
[  119.021177] mlx5_core f0f5:00:02.0 enP61685s1: Error cqe on cqn 0x32c, ci 0x29e, sqn 0xbf, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[  119.025962] mlx5_core f0f5:00:02.0 enP61685s1: ERR CQE on SQ: 0xc3
[  119.033815] mlx5_core f0f5:00:02.0 enP61685s1: ERR CQE on SQ: 0xbf
[  120.002547] mlx5_core f0f5:00:02.0 enP61685s1: Error cqe on cqn 0x324, ci 0xf3, sqn 0xb7, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[  120.010110] mlx5_core f0f5:00:02.0 enP61685s1: ERR CQE on SQ: 0xb7
[  120.032390] mlx5_core f0f5:00:02.0 enP61685s1: Error cqe on cqn 0x328, ci 0x26c, sqn 0xbb, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[  120.041160] mlx5_core f0f5:00:02.0 enP61685s1: ERR CQE on SQ: 0xbb
2022-01-31T18:39:47.188042Z INFO Daemon Agent WALinuxAgent-2.6.0.2 launched with command 'python3 -u bin/WALinuxAgent-2.6.0.2-py2.7.egg -run-exthandlers' is successfully running
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï

Environment

VMware Tanzu Kubernetes Grid 1.x

Cause

Tanzu kubernetes cluster (tkg1.3.0) could not scale due to incompatible with Azure accelerated networking. Azure accelerated networking is enabled by default on most VM instances that have 4 vCPUs or more.

Resolution

There is currently a workaround to fix this issue you need to disable accelerated networking on the azure VM .This should be done in the crd.

Workaround:
Note : Scale down the cluster to a stable/running state first.

Step 1-:

You need to create new AzureMachineTemplate that has "acceleratedNetworking: false".

kubectl config use-context <Mgmt Cluster Context>
kubectl get AzureMachineTemplate -A   
kubectl get AzureMachineTemplate workload-cluster-control-plane -o yaml > new-AzureMachineTemplate.yaml

Here in your generated yaml you need to modify the name for the AzureMachineTemplate to a new azuremachinetemplate name and also make sure that the accelerated networking is set to false --> "acceleratedNetworking: false". Save the template now.
- Reference template spec just for acceleratedNetworking setting to verify

spec:
  template:
    spec:
      acceleratedNetworking: false   ## <-- here make sure acceleratedNetworking is set to false. 
      dataDisks:
      - diskSizeGB: 256
        lun: 0
        nameSuffix: etcddisk
      identity: None
      image:
        marketplace:
          offer: tkg-capi
          publisher: vmware-inc

Apply to create new AzureMachineTemplate

kubectl apply -f new-AzureMachineTemplate.yaml

Note : Make sure you follow the same for the worker node template creation as well.

Step 2-:

Edit the KubeadmControlPlane (kcp) and make sure that it points to the new Azure machine template.

kubectl get KubeadmControlPlane  
kubectl edit KubeadmControlPlane xxxxx-workload-control-plane

spec:
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: AzureMachineTemplate
    name: workload-control-plane-new # <-- point the new azure machine template for the control plane here 
    namespace: default
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          cloud-config: /etc/kubernetes/azure.json
          cloud-provider: azure

After you edit and save the KubeadmControlPlane the new machines for the control plane nodes will roll out successfully.

Step 3-:

Now you need to edit the machine deployment to point the new azure machine template for the worker node.

kubectl get machinedeployment -n <namespace>
kubectl edit machinedeployment  xxxxxx-workload-md-0

bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
          kind: KubeadmConfigTemplate
          name: workload-md-0
      clusterName: workload
      failureDomain: "1"
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
        kind: AzureMachineTemplate
        name: xxxx-workload-md-0-new  # <-- point the new azure machine template for worker node here and save it

This will roll out and provision the new machine for the worker nodes in the cluster.

You can now proceed to scaling up the cluster.

Additional Information

Impact/Risks:
TKG cluster could not scale due to accelerated networking is enabled on azure VM's of the cluster.