NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0` - `llanta-service` container
search cancel

NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0` - `llanta-service` container

book

Article ID: 372913

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

The health status of the NSX Intelligence feature is reported as DOWN in the UI.

The DOWN status may be caused by the llanta-detectors-0  pod running out of memory.

By Running the below commands on am NSX Manager while you have an ssh session as root you will see the following output snippets:


The llanta-detectors-0  pod is reported in the CrashLoopBackOff status:

root@nsx-mgr-0:~# napp-k get pods | grep llanta-detectors-0
NAME READY STATUS RESTARTS AGE
llanta-detectors-0 3/4 CrashLoopBackOff 16 (3m19s ago) 28h


The llanta-service container in the llanta-detectors-0  pod is reported as having crashed after running out of memory (OOM):

root@nsx-mgr-0:~# napp-k describe pod llanta-detectors-0
Name: llanta-detectors-0
Namespace: nsxi-platform
Priority: 0
Service Account: llanta-detectors-sa
Node: napp-cluster-default-workers-b7n6f-7689678688-czq49/40.40.0.56
Start Time: Sun, 11 Feb 2024 06:28:56 +0000
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled <<<<<<<<<<<<================
Exit Code: 137
Started: Mon, 12 Feb 2024 10:41:08 +0000
Finished: Mon, 12 Feb 2024 10:42:57 +0000
Ready: False
Restart Count: 16
...
Containers:
  ...
  llanta-service: <<<<<<<<<<<<================
    ...
    State:          Running
      Started:      Mon, 12 Feb 2024 10:41:08 +0000
    Last State:     Terminated
      Reason:       OOMKilled. <<<<<<<<<<<<================


If you see a different container listed than the one listed for this KB, please search to the correct KB with the container listed.

 

Environment

This issue has been observed on ATP version 4.2.0, in particular scenarios where workloads communicate in a mash fashion on a high number of ports.

Cause

These containers process data and maintain an in-memory state that in some cases can grow to exceed the limits set for the container. The main factors contributing to the size of the state are:

  • Number of enabled detectors
  • Network topology
  • Flow rate is a close 3rd contributer for this container.

Resolution

Resolution

Solution 1 - increase the limits

If you believe environment has enough resources to allow higher limits for the detectors, we suggest increasing the limits for the containers. The memory pressure on the nodes can be inspected in the NSX UI under System → NSX Application Platform. The page provides a detailed summary of the available memory on each node.

To increase the limits for the containers, patch the llanta-detectors statefulset resource via the following Kubernetes command:

LLANTA_SERVICE_LIMIT="<NEW_VALUE_LLANTA_SERVICE>" && napp-k patch statefulset llanta-detectors -p="{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"llanta-service\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_SERVICE_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_SERVICE_LIMIT\"}}}]}}}}"

 

For example, to set the memory limit to 12Gi for the llanta-service  container, execute:

LLANTA_SERVICE_LIMIT="12Gi" && napp-k patch statefulset llanta-detectors -p="{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"llanta-service\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_SERVICE_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_SERVICE_LIMIT\"}}}]}}}}"

Delete the llanta-detectors-0  pod and let Kubernetes restart it with the new resource values:

napp-k delete pod llanta-detectors-0

NOTE: This command also aligns the requests for the containers to the limits. This ensures that the cluster has enough capacity to run the pod with the updated limits.

Finally, wait a few minutes and verify the llanta-detectors-0  pod is running :

napp-k get pods |grep llanta-detectors-0

NOTE: If the pod fails to be scheduled, it can be examined via the following command:

napp-k describe pod llanta-detectors-0

Failed scheduling of the pod is indicated the the Events  section of the output:

Warning  FailedScheduling  32s   default-scheduler  0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.

If the cluster does not have enough capacity to accommodate the new limits, the possible solutions are:

  • Repeat the patch command above with lower values for the limits
  • Increase the cluster capacity by adding worker nodes

Solution 2 - disable some of the detectors

If increasing the limits is not an option in the particular environment (i.e, the memory pressure on the nodes is already high), we suggest disabling some of the affected detectors in order to reduce the size of the state. This can be done incrementally, depending on the priority that each detector has in the specific environment.

Detectors can be disabled via the NSX Manager UI under Security → Suspicious Traffic → Detector Definitions , or Threat Detection & Response → Settings → NTA Detectors Definitions . Below is the suggested order we advise to disable detectors (in decreasing order of memory requirements), if the customer does not have specific priorities:

  • Unusual Network Traffic Pattern
    • This detector typically represents 90% of the memory used by the llanta-service container
    • Versions older than 4.2.0 have a bug in data retention that will grow the memory requirement linearly over time
  • Destination IP profiler
    • Versions older than 4.2.0 did not aggregate data aggressively enough leading to high memory consumption in complex topologies
    • Implementation pre 4.2.0 had slower performance leading to some intermittent slow downs
  • Port Profiler or Server Port Profiler (they require a similar amount of memory internally)
  • Netflow Beaconing
  • DNS Tunneling or DGA (they require a similar amount of memory internally)
  • Unusual Data Upload/Download (minimal memory requirements, not likely to have any impact)

Solution 3 (ATP > 4.2.0 only) - reduce the limit for the memory guardrail for the llanta-service container

To reduce the memory guardrail threshold for the container, we need to modify the memory_limit_percentage value in the llanta-service-env-vars configmap. The default value for this field is 90.

Edit the configmap via

MEMORY_LIMIT_PERCENTAGE=<NEW_MEMORY_LIMIT_PERCENTAGE> && napp-k get cm llanta-service-env-vars -o yaml | sed "s/memory_limit_percentage:.*/memory_limit_percentage: $MEMORY_LIMIT_PERCENTAGE/" |  napp-k replace configmap llanta-service-env-vars -f -

Choosing the new value for the option option (e.g, lowering from 90 to 60):

MEMORY_LIMIT_PERCENTAGE=60 && napp-k get cm llanta-service-env-vars -o yaml | sed "s/memory_limit_percentage:.*/memory_limit_percentage: $MEMORY_LIMIT_PERCENTAGE/" |  napp-k replace configmap llanta-service-env-vars -f -

Delete the llanta-detectors-0  pod and let Kubernetes restart it with the new configuration value:

napp-k delete pod llanta-detectors-0

NOTE: This workaround affects the amount of baseline that is retained for the affected NTA detectors. When the guardrail is hit, the llanta-service container will drop some of the baseline to reclaim some memory. We suggest keeping this Solution as a last option in case increasing the limits for the pod is not possible.

Additional Information

If you are following Solution 1 to solve this issue, you should also include following
NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0` - `llanta-job*` container
or that issue will likely be hit later