How to Increase Multus-cni DaemonSet resource limits when Pods creation is failing intermittently

Products

VMware Telco Cloud Automation

Issue/Introduction

This document contains the procedure to update the multus resource limit. In this procedure we are increasing memory request/limit on the multus-cni Daemonset container "kube-multus"This document contains the procedure to update the multus resource limit. In this procedure we are increasing memory request/limit on the multus-cni Daemonset container "kube-multus".

Symptoms:

TCA version 3.0. Pods creation is failing intermittently with Cluster version 1.26.8 K8S TKG 2.3.1.
Issue is not observed when creating the cluster with only the Calico Add-on. It is only observed when adding Multus Add-on to the cluster

Error : Feb 29 16:58:00 xyz-test-vmw-1-np2-cork-4rpkg-75c77cb44fxpdb88-7nhzp kubelet[1395]: E0229 16:58:00.068830 1395 remote_runtime.go:205] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"xy6041930e87e6f22f96f07809c178f614d6e10bc17f70dfa1c144daee8855d4\": plugin type=\"multus-shim\" name=\"multus-cni-network\" failed (delete): CmdDel (shim): failed to send CNI request: Post \"http://dummy/cni\": EOF" podSandboxID="xy6041930e87e6f22f96f07809c178f614d6e10bc17f70dfa1c144daee8855d4"

Environment

3.1
3.0

Cause

Users have experienced OOM-killed calico-ipam processes when using multus+calico in certain clusters (likely higher scale). This causes intermittent issues when creating containers. The calico-ipam plugin was being OOM-killed in the multus-cni DaemonSet pod because the 50Mi memory limit was too low.
the limit hit in the log: "memory: usage 51200kB, limit 51200kB". So it requires to increase the memory request/limit on multus.

Resolution

Increase the memory request/limit on multus via the TCA UI

1. Log into the TCA Web UI.

2. Go to Infrastructure > CaaS Infrastructure.

3. Click target workload cluster from the Cluster list.

4. Click Add-ons.

5. Click three-dots before the multus addon and click Edit.

6. Click the SAVE button on the Add-on Configuration dialog.

7. Click the NEXT button.

8. Click Custom Resources (CR) on the top.

9. Edit yaml file on the right-hand pane as shown here:

10. Click DEPLOY CHANGES at the bottom.

11. Wait for the addon status to change to a Provisioned state.

Verification:

1. Login to the TCA-CP where the Management cluster is deployed as admin user.

2. Run the below command as root to ssh to workload cluster

su -

ssh capv@<workload cluster endpoint IP>

3 Check if multus pods have the new resources.

kubectl get pod -n kube-system -l name=multus -o jsonpath="{range .items[*]}{.spec.containers[*].resources}{'\n’}"

Increase the memory request/limit on multus via the TCA-CP command line:

Note: The change via command line will be overwritten by update on TCA UI. So you need to edit Multus Addon on UI after upgrade as soon as possible.

Login to the TCA-CP where the Management cluster is deployed as admin user.
Run the below command as root to ssh to workload cluster

su -

ssh capv@<management cluster endpoint IP>

3. Get the current multus values.yaml

kubectl -n <workload cluster name> get secret multus-tca-addon-secret -o "jsonpath={@.data.values\.yaml}"|base64 -d > multus.yaml

4. Add resources to multus.yaml

cat <<EOF>> multus.yaml

resources:

limits:

cpu: 300m

memory: 150Mi

requests:

cpu: 200m

memory: 100Mi

EOF

5. Apply the new values.yml to multus secret.

VALUES_YAML=`base64 -w0 multus.yaml`

kubectl patch secret -n <workload cluster name> multus-tca-addon-secret --patch '{"data":{"values.yaml":"'$VALUES_YAML'"}}'

6. Exit from management cluster and ssh to workload cluster

exit

ssh capv@<workload cluster endpoint IP>

7. Check if multus pods have the new resources.

kubectl get pod -n kube-system -l name=multus -o jsonpath="{range .items[*]}{.spec.containers[*].resources}{'\n'}"

Additional Information

Impact/Risks:
This issue was observed in Telco Cloud Automation 3.0. Issue is only applicable to Multus 4.0.1 or above and TKG 2.3.1 or above. This issue may occur on TCA 3.1.