NAPP Envoy proxy (Contour) getting OOMKilled when deploying NAPP hosted Malware Prevention Service VMs

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

When deploying Malware Prevention Service VMs which are hosted within the NAPP, sometimes the deployment fails with an error seen in NSX.

In the vCenter UI, we see that the OVF deployment task gets cancelled automatically with a message saying "The task was canceled by a user."

In NAPP, we see that the projectcontour-envoy pods are getting restarted and we see OOMKilled as the reason for termination.

Environment

NAPP 4.2.0 and 4.2.0.1 when used with vCenter 8 update 3e and later.

NSX 9.0 + VC 9.0

Cause

Starting with NAPP 4.2.0, Malware Prevention Service (MPS) VM Images are hosted in the NAPP repository itself. When deploying MPS Service VMs, the bits are downloaded from the NAPP (which acts as the server for the files) to the vCenter's EAM (which is the client in this scenario) and then deployed to the required ESXi hosts in the environment.
In this workflow, the communication between NAPP and vCenter happens between the proxies at both sides, which is envoy in this case.

Due to configuration changes in the vCenter envoy proxy, starting from vCenter 8 update 3e, we see that http2 is the default protocol being negotiated between the two proxies.

HTTP/2 has larger default memory requirements due to 256MB default buffer allocations.

Due to this we see that, on the NAPP side, the proxy buffers a lot of data while serving the MPS SVM bits and the envoy pods get OOMKilled.

The following indicators can be used to identify this issue:

The MPS service SVM deployment fails with an error seen in NSX.

In the vCenter UI, we see that the OVF deployment task gets cancelled automatically with a message saying "The task was canceled by a user." In the EAM logs in vCenter (/var/log/eam/eam.log in VC) we see that the bits are partially downloaded before encountering an error:

2025-05-05T06:44:22.542Z | ERROR | VM-push-dispatcher-16 | UploadConnection.java | 202 | Upload failed.org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 1,917,239,296; received: 289,759,250)

In NAPP:

We see that the envoy pods are getting restarted

napp-k get pods -n projectcontour
NAME                                      READY   STATUS    RESTARTS        AGE
projectcontour-envoy-ck4t4                2/2     Running   1 (2d16h ago)   6d23h

When we describe the pod, we see OOMKilled as the reason for the last restart

napp-k describe pod projectcontour-envoy-ck4t4 -n projectcontour

    State:          Running
      Started:      Fri, 02 May 2025 16:30:13 +0000
    Last State:     Terminated
      Reason:       OOMKilled                 <----------- OOMKilled seen as the reason for last restart
      Exit Code:    1
      Started:      Thu, 01 May 2025 01:42:50 +0000
      Finished:     Fri, 02 May 2025 16:30:12 +0000

Resolution

To resolve this issue, we need to provide a few additional configuration options for the envoy proxy in NAPP to optimize service large files and also allocate a bigger memory resources for the pods.

Also note that in order not to overburden the NAPP proxy, SVM deployment should be triggered only on one cluster at a time.

The below edits are required in NAPP proxy configurations (projectcontour)

1. Edit ConfigMap for projectcontour.

Config Map YAML

*** Append the Yellow Text in the below projectcontour configmap to add the cluster and listener configurations ***

Add the cluster and listener configurations to the data→contour.yaml section

root@nsx-mgr-0:~# napp-k edit configmap -n projectcontour projectcontour -o yaml

apiVersion: v1
data:
  contour.yaml: |-
    accesslog-format: envoy
    cluster:                                         <------- Add cluster and listener configurations
      per-connection-buffer-limit-bytes: 65536
    listener:
      http2-max-concurrent-streams: 100
      per-connection-buffer-limit-bytes: 65536
    disablePermitInsecure: false
    tls:
      envoy-client-certificate:
...
...

2. Edit the project contour deployment

Deployment YAML

*** Increase the memory to 96Mi highlighted in Yellow ***

Update the memory to 96Mi to the spec->template->spec→resources->requests section

root@nsx-mgr-0:~# napp-k edit deployment -n projectcontour projectcontour-contour -o yaml


apiVersion: apps/v1
kind: Deployment
metadata:
 ...
spec:
  ...
  template:
    ...
    spec:
      ...
        resources:
          limits:
            memory: 256Mi
          requests:
            cpu: 40m
            memory: 96Mi           <----- Increase the memory to 96Mi
        ...
...
...

3.Update the memory allocations mentioned in the daemonset for the projectcontour-envoy pods

*** Update and append highlighted in Yellow in the below yaml ***

Update the spec->template->spec->args (where command is 'envoy') ->resources section as below
Update the spec->template->spec->args (where command is 'contour') ->resources section as below
Add the '- --overload-max-heap=335544320' argument to args where command is 'contour'

root@nsx-mgr-0:~# napp-k edit daemonset -n projectcontour projectcontour-envoy -o yaml

apiVersion: apps/v1
kind: DaemonSet
...
spec:
  ...
  template:
    ...
    spec:
      affinity: {}
      automountServiceAccountToken: false
      containers:
      ...
      - args:
        - -c
        - /config/envoy.json
        - --service-cluster $(CONTOUR_NAMESPACE)
        - --service-node $(ENVOY_POD_NAME)
        - --log-level info
        command:
        - envoy
        ...
        resources:
          limits:
            memory: 500Mi                 <---- Increase memory and CPU resources allocated
          requests:
            cpu: 200m
            memory: 300Mi
        ...
      initContainers:
      - args:
        - bootstrap
        - /config/envoy.json
        - --xds-address=some-address
        - --xds-port=some-port
        - --resources-dir=/config/resources
        - --envoy-cafile=/certificate/sample-ca.crt
        - --envoy-cert-file=/certificate/sample.crt
        - --envoy-key-file=/certificate/sample.key
        - --overload-max-heap=335544320      <----- provide this additional argument
        command:
        - contour
        ...
        resources:
          limits:
            memory: 500Mi                   <---- Increase memory and CPU resources allocated
          requests:
            cpu: 200m
            memory: 300Mi
        ...