The health status of the NSX Intelligence feature is reported as DOWN in the UI.
The DOWN status may be caused by the llanta-detectors-0
pod running out of memory.
By Running the below commands on am NSX Manager while you have an ssh session as root you will see the similar output snippets:
The llanta-detectors-0
pod is reported in the CrashLoopBackOff status:
|
The llanta-worker
container in the llanta-detectors-0
pod are reported as having crashed after running out of memory (OOM):
|
This issue has been observed on NAPP versions 4.1.2, 4.1.2.1 4.2.0 using NTA.
These containers process data and maintain an in-memory state that in some cases can grow to exceed the limits set for the container. The main factors contributing to the size of the state are:
Detectors can be review via the NSX Manager UI under Security → Suspicious Traffic → Detector Definitions
, or Threat Detection & Response → Settings → NTA Detectors Definitions
If no affected NTA detectors are enabled and there is no need for these detectors to be enabled on the setup, there is no need for the llanta-detectors-0
pod to be running. We suggest disabling the pod completely by scaling down the llanta-detectors
statefulset to 0 replicas:
|
This will effectively bring down the llanta-detectors-0
pod, and suspend all processing of flows.
NOTE: In this version of ATP there is no automatic way of restoring the replicas if the customer wants to enable some of the affected NTA detectors. In order to restore the replicas, another manual step is needed:
|
llanta-worker
polls at each iterationIf Solution 1 is not applicable for any reasons, e.g, the customer does not want to completely disable the pod because they might want to enable some of the detectors without performing further manual steps, we suggest reducing the amount of flows that the llanta-worker
container polls at each iteration from the messaging system. This should release memory pressure on the worker and solve the OOM issue.
In order to reduce the amount of flows that the llanta-worker
container polls at each iteration, a configuration parameter can be set to control the consumer for the messaging system. In particular, the max_records
configuration parameter (default: 1000) can be overridden in the llanta-worker-env-vars
configmap as follows:
Edit the configmap via
|
After editing the configmap we need to restart the pod for the changes to take effect. This is done by running the below command:
|
llanta-worker
containerIf streaming NTA detectors are enabled, we suggest to reduce the aggregation timeout (which has a default of 60 seconds) for the llanta-worker
container to reduce the size of the internal data structures during processing. To do this, we can control the parameter aggregation_timeout_seconds
in the llanta-worker-env-vars
configmap.
Edit the configmap via
|
After editing the configmap we need to restart the pod for the changes to take effect. This is done by running the below command:
|
This option reduces the amount of time that the llanta-worker
container will accumulate flows for before sending them to the llanta-service
container, and simplifies the aggregation of observed flows into batches while decreasing the memory footprint of the process. It is only applicable if streaming NTA detectors are enabled.
llanta-worker
container
|
For example, to set the memory limit to 3Gi, execute:
|
Finally, delete the llanta-detectors-0
pod and let Kubernetes restart it:
|
NOTE: This command also aligns the requests for the containers to the limits. This ensures that the cluster has enough capacity to run the pod with the updated limits. If the pod fails to be scheduled, it can be examined via the following command:
|
Failed scheduling of the pod is indicated the the Events
section of the output:
|
If the cluster does not have enough capacity to accommodate the new limits, the possible solutions are:
patch
command above with lower values for the limits