NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0`

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

The health status of the NSX Intelligence feature is reported as DOWN in the UI.

The DOWN status may be caused by the llanta-detectors-0 pod running out of memory.

By Running the below commands on am NSX Manager while you have an ssh session as root you will see the similar output snippets:

The llanta-detectors-0 pod is reported in the CrashLoopBackOff status:

root@nsx-mgr-0:~# napp-k get pods | grep llanta-detectors-0
NAME READY STATUS RESTARTS AGE
llanta-detectors-0 3/4 CrashLoopBackOff 16 (3m19s ago) 28h

The llanta-worker container in the llanta-detectors-0 pod are reported as having crashed after running out of memory (OOM):

root@nsx-mgr-0:~# napp-k describe pod llanta-detectors-0
Name: llanta-detectors-0
Namespace: nsxi-platform
Priority: 0
Service Account: llanta-detectors-sa
Node: napp-cluster-default-workers-b7n6f-7689678688-czq49/40.40.0.56
Start Time: Sun, 11 Feb 2024 06:28:56 +0000
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled <<<<<<<<<<<<================
Exit Code: 137
Started: Mon, 12 Feb 2024 10:41:08 +0000
Finished: Mon, 12 Feb 2024 10:42:57 +0000
Ready: False
Restart Count: 16
...
Containers:
  ...
  llanta-worker:<<<<<<<<<<<<================
    ...
    State:          Running
      Started: Mon, 12 Feb 2024 10:41:08 +0000
    Last State:     Terminated
      Reason:       OOMKilled. <<<<<<<<<<<<================

If you see a different container listed than the one listed for this KB, please search to the correct KB with the container listed.

Environment

This issue has been observed on NAPP versions 4.1.2, 4.1.2.1 4.2.0 using NTA.

Cause

These containers process data and maintain an in-memory state that in some cases can grow to exceed the limits set for the container. The main factors contributing to the size of the state are:

Flow rate
Network topology is a close 2nd cause
Number of enabled detectors is a close 2nd cause

Resolution

Review for affected NTA detectors being enabled as these could impact the solution you choose:

Detectors can be review via the NSX Manager UI under Security → Suspicious Traffic → Detector Definitions , or Threat Detection & Response → Settings → NTA Detectors Definitions

Data Upload/Download
Port Profiler
Server Port Profiler
Destination IP Profiler
Netflow Beaconing
Unusual Network Traffic Pattern
DNS Tunneling
Domain Generation Algorithm

Solution 1 (ATP < 4.2.0 only, no streaming NTA detectors enabled and no immediate need for these detectors) - disable the llanta pod

If no affected NTA detectors are enabled and there is no need for these detectors to be enabled on the setup, there is no need for the llanta-detectors-0 pod to be running. We suggest disabling the pod completely by scaling down the llanta-detectors statefulset to 0 replicas:

napp-k scale statefulset llanta-detectors-0 --replicas=0

This will effectively bring down the llanta-detectors-0 pod, and suspend all processing of flows.
NOTE: In this version of ATP there is no automatic way of restoring the replicas if the customer wants to enable some of the affected NTA detectors. In order to restore the replicas, another manual step is needed:

napp-k scale statefulset llanta-detectors-0 --replicas=1

Solution 2 (ATP < 4.2.0 only, no streaming NTA detectors enabled but solution 1 is not applicable) - reduce the number of flows that `llanta-worker` polls at each iteration

If Solution 1 is not applicable for any reasons, e.g, the customer does not want to completely disable the pod because they might want to enable some of the detectors without performing further manual steps, we suggest reducing the amount of flows that the llanta-worker container polls at each iteration from the messaging system. This should release memory pressure on the worker and solve the OOM issue.

In order to reduce the amount of flows that the llanta-worker container polls at each iteration, a configuration parameter can be set to control the consumer for the messaging system. In particular, the max_records configuration parameter (default: 1000) can be overridden in the llanta-worker-env-vars configmap as follows:

Edit the configmap via

MAX_RECORDS=1 && napp-k get cm llanta-worker-env-vars -o yaml | sed "/max_records/d" | sed "s/broker_location:.*/&\n max_records: $MAX_RECORDS/" | napp-k replace configmap llanta-worker-env-vars -f -

After editing the configmap we need to restart the pod for the changes to take effect. This is done by running the below command:

napp-k delete pod llanta-detectors-0

Solution 3 (streaming NTA detectors enabled) - reduce the number aggregation timeout for the `llanta-worker` container

If streaming NTA detectors are enabled, we suggest to reduce the aggregation timeout (which has a default of 60 seconds) for the llanta-worker container to reduce the size of the internal data structures during processing. To do this, we can control the parameter aggregation_timeout_seconds in the llanta-worker-env-vars configmap.

Edit the configmap via

AGGREGATION_TIMEOUT_SECONDS=1 && kubectl --kubeconfig=/Users/anarratonef/dev/terraform/tkc.kubeconfig -n nsxi-platform get cm llanta-worker-env-vars -o yaml | sed "/token_aggregator/d" | sed "/aggregation_timeout_seconds/d" | sed "s/WORKER_CLASS:.*/ token_aggregator:\n aggregation_timeout_seconds: $AGGREGATION_TIMEOUT_SECONDS\n &/" | kubectl --kubeconfig=/Users/anarratonef/dev/terraform/tkc.kubeconfig -n nsxi-platform replace configmap llanta-worker-env-vars -f -

After editing the configmap we need to restart the pod for the changes to take effect. This is done by running the below command:

napp-k delete pod llanta-detectors-0

This option reduces the amount of time that the llanta-worker container will accumulate flows for before sending them to the llanta-service container, and simplifies the aggregation of observed flows into batches while decreasing the memory footprint of the process. It is only applicable if streaming NTA detectors are enabled.

Solution 4 - increase the limits for the `llanta-worker` container

LLANTA_WORKER_LIMIT="<NEW_VALUE_LLANTA_WORKER>" && napp-k patch statefulset llanta-detectors -p="{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"llanta-worker\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_WORKER_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_WORKER_LIMIT\"}}}]}}}}"

For example, to set the memory limit to 3Gi, execute:

LLANTA_WORKER_LIMIT="3Gi" && napp-k patch statefulset llanta-detectors -p="{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"llanta-worker\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_WORKER_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_WORKER_LIMIT\"}}}]}}}}"

Finally, delete the llanta-detectors-0 pod and let Kubernetes restart it:

napp-k delete pod llanta-detectors-0

NOTE: This command also aligns the requests for the containers to the limits. This ensures that the cluster has enough capacity to run the pod with the updated limits. If the pod fails to be scheduled, it can be examined via the following command:

napp-k describe pod llanta-detectors-0

Failed scheduling of the pod is indicated the the Events section of the output:

Warning FailedScheduling 32s default-scheduler 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.

If the cluster does not have enough capacity to accommodate the new limits, the possible solutions are:

Repeat the patch command above with lower values for the limits
Increase the cluster capacity by adding worker nodes