NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0`

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

The health status of the NSX Intelligence feature is reported as DOWN in the UI.

The DOWN status may be caused by the llanta-detectors-0 pod running out of memory.

By Running the below commands on am NSX Manager while you have an ssh session as root you will see the following output snippets:

The llanta-detectors-0 pod is reported in the CrashLoopBackOff status:

root@nsx-mgr-0:~# napp-k get pods | grep llanta-detectors-0
NAME READY STATUS RESTARTS AGE
llanta-detectors-0 3/4 CrashLoopBackOff 16 (3m19s ago) 28h

One or more of the llanta-job-netflow-beaconing, llanta-job-time-series containers in the llanta-detectors-0 pod are reported as having crashed after running out of memory (OOM):

root@nsx-mgr-0:~# napp-k describe pod llanta-detectors-0
Name: llanta-detectors-0
Namespace: nsxi-platform
Priority: 0
Service Account: llanta-detectors-sa
Node: napp-cluster-default-workers-b7n6f-7689678688-czq49/40.40.0.56
Start Time: Sun, 11 Feb 2024 06:28:56 +0000
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled <<<<<<<<<<<<================
Exit Code: 137
Started: Mon, 12 Feb 2024 10:41:08 +0000
Finished: Mon, 12 Feb 2024 10:42:57 +0000
Ready: False
Restart Count: 16
...
Containers:
  ...
  llanta-jon-time-series: <<<<<<<<<<<<================
    ...
    State:          Running
      Started: Mon, 12 Feb 2024 10:41:08 +0000
    Last State:     Terminated
      Reason:       OOMKilled. <<<<<<<<<<<<================

If you see a different container listed than the one listed for this KB, please search to the correct KB with the container listed.

Environment

This issue has been observed on ATP version 4.2.0, in particular scenarios where workloads communicate in a mash fashion on a high number of ports.

Cause

These containers process data and maintain an in-memory state that in some cases can grow to exceed the limits set for the container. The main factors contributing to the size of the state are:

Number of enabled detectors
Network topology
Flow rate is a close 3rd contributer for this container

Resolution

Solution 1 - increase the limits

If you believe environment has enough resources to allow higher limits for the detectors, we suggest increasing the limits for the containers. The memory pressure on the nodes can be inspected in the NSX UI under System → NSX Application Platform. The page provides a detailed summary of the available memory on each node.

To increase the limits for the containers, patch the llanta-detectors statefulset resource via the following Kubernetes command:

LLANTA_JOB_LIMIT="<NEW_VALUE_LLANTA_JOB>" && napp-k patch statefulset llanta-detectors -p="{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"llanta-job-netflow-beaconing\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_JOB_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_JOB_LIMIT\"}}}, {\"name\":\"llanta-job-time-series\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_JOB_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_JOB_LIMIT\"}}}]}}}}"

For example, to set the memory limit to 10Gi for the llanta-job-* containers, execute:

LLANTA_JOB_LIMIT="10Gi" && napp-k patch statefulset llanta-detectors -p="{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"llanta-job-netflow-beaconing\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_JOB_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_JOB_LIMIT\"}}}, {\"name\":\"llanta-job-time-series\", \"resources\":{\"limits\":{\"memory\": \"$LLANTA_JOB_LIMIT\"},\"requests\":{\"memory\": \"$LLANTA_JOB_LIMIT\"}}}]}}}}"

Delete the llanta-detectors-0 pod and let Kubernetes restart it with the new resource values:

napp-k delete pod llanta-detectors-0

NOTE: This command also aligns the requests for the containers to the limits. This ensures that the cluster has enough capacity to run the pod with the updated limits.

Finally, wait a few minutes and verify the llanta-detectors-0 pod is running :

napp-k get pods |grep llanta-detectors-0

NOTE: If the pod fails to be scheduled, it can be examined via the following command:

napp-k describe pod llanta-detectors-0

Failed scheduling of the pod is indicated the the Events section of the output:

Warning FailedScheduling 32s default-scheduler 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.

If the cluster does not have enough capacity to accommodate the new limits, the possible solutions are:

Repeat the patch command above with lower values for the limits
Increase the cluster capacity by adding worker nodes

Solution 2 - disable the detectors responsible for the jobs

If increasing the limits is not an option in the particular environment (i.e, the memory pressure on the nodes is already high), we suggest disabling the detectors that the two job containers are handling.

Detectors can be disabled via the NSX Manager UI under Security → Suspicious Traffic → Detector Definitions , or Threat Detection & Response → Settings → NTA Detectors Definitions . The detectors that can be disabled to release memory pressure on the llanta-job-* containers are:

Unusual Network Traffic Pattern
Netflow Beaconing

Additional Information

If you are following Solution 1 to solve this issue, you should also include following
NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0` - `llanta-service` container
or that issue will likely be hit later