NSX Intelligence POD Llanta-Detectors-0 Pod Crash Due to High Memory Usage

Products

VMware NSX

Issue/Introduction

Symptoms:

The health status of the NSX Intelligence feature is reported as DOWN in the UI.

The following snippets show commands used to check the pod status on an affected installation and reporting the error conditions. The commands are run from the NSX manager.

The llanta-detectors-0 pod is reported as being the CrashLoopBackOff status:
root@nsx-mgr-0:~# napp-k get pods | grep llanta-detectors-0
NAME READY STATUS RESTARTS AGE
llanta-detectors-0 3/4 CrashLoopBackOff 16 (3m19s ago) 28h

The llanta-detectors-0 pod is reported as having crashed after running out of memory (OOM):

root@nsx-mgr-0:~# napp-k describe pod llanta-detectors-0
Name: llanta-detectors-0
Namespace: nsxi-platform
Priority: 0
Service Account: llanta-detectors-sa
Node: napp-cluster-default-workers-b7n6f-7689678688-czq49/40.40.0.56
Start Time: Sun, 11 Feb 2024 06:28:56 +0000
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 12 Feb 2024 10:41:08 +0000
Finished: Mon, 12 Feb 2024 10:42:57 +0000
Ready: False
Restart Count: 16
...

Environment

VMware NSX 4.1.0.2

Cause

In high load conditions (typically, around 500k flows per 5 minutes), depending on the traffic being monitored, it may occur that the pod llanta-detectors-0 crashes having run out of memory (OOM) and eventually ends up in the CrashLoopBackOff status. Depending on the specific circumstances on the affected installation, the pod may recover automatically after some time or may continue staying in that error state.

We have identified two reasons that may lead to this situation.

First, in some conditions (depending on the particular traffic topology and mix), processing a high volume of flows may lead the llanta-detectors-0 pod to use more memory than it is allowed by the system’s configuration. In this case, one or more of its containers will end up in the OOM condition.

Second, in some conditions, when the llanta-detectors-0 pod (more precisely, its llanta-worker container) reads network flow records from the kafka queuing system, it fetches too many records at the same time and ends up running out of memory. Notice that in this case the only container in the OOM status is the llanta-worker container.

Resolution

The resolution to this problem is implemented as a series of optimization improvements that reduce the memory usage of the llanta-detectors-0 pod even in high flow rate and unfavorable traffic topology/mix conditions.

VERSIONS WHERE THIS IS A KNOWN ISSUE: 4.1.2

VERSION WHERE THIS IS FIXED: 4.2

Workaround:
To workaround the issue with the overall memory usage in the llanta-detectors-0 pod, we recommend increasing the memory limits assigned to its containers using the following instructions (run on the NSX manager):

Execute napp-k edit statefulset llanta-detectors
Manually modify the limits from 8Gi to 12Gi for the following containers: llanta-service, llanta-worker, llanta-job-time-series, and llanta-job-netflow-beaconing
Execute napp-k delete pod llanta-detectors-0 to ensure the pod gets restarted with the new limits

To workaround the issue caused by fetching too many flow records, we recommend decreasing the maximum number of records fetched by the llanta-worker container using the following instructions (run on the NSX manager):

Execute napp-k edit configmap llanta-worker-env-vars
Under the kafka key, add the option max_records: 1 (see below for an example)
Execute `napp-k delete pod llanta-detectors-0` to ensure the pod gets restarted with the new configuration

Example ConfigMap with the added max_records configuration:

...
kafka:
max_records: 1
broker_location: kafka:9092
consumer_group: 'llanta-detectors'
...

Additional Information

Impact/Risks:
The impact of this error is that the health of the NSX Intelligence feature is reported as DOWN and that the following Suspicious Traffic Detectors are not functional: Data Upload/Download, Destination IP Profiler, DNS Tunneling, Domain Generation Algorithm (DGA), Netflow Beaconing, Port Profiler, Server Port Profiler, and Unusual Network Traffic Pattern.

RELEVANT LOG'S LOCATION: The support bundle for NAPP contains logs for all services running on NAPP, including the llanta-detectors-0 relevant to this issue. If investigating a live system, the logs generated by the llanta-detectors-0 pod are relevant; examples of the commands that can be used to obtain these logs are presented in the SYMPTOMS section.

STEPS TO REPRODUCE: The issue may occur on high flow rate conditions (e.g., 500k flows/5 minute). Depending on the specific characteristics of the traffic being monitored, such a flow rate may lead to high memory utilization in the llanta-detectors-0 pod.