The health status of the NSX Intelligence feature is reported as DOWN in the UI.
The DOWN status may be caused by the llanta-detectors-0
pod running out of memory.
By Running the below commands on am NSX Manager while you have an ssh session as root you will see the following output snippets:
The llanta-detectors-0
pod is reported in the CrashLoopBackOff status:
|
The llanta-service
container in the llanta-detectors-0
pod is reported as having crashed after running out of memory (OOM):
|
If you see a different container listed than the one listed for this KB, please search to the correct KB with the container listed.
This issue has been observed on ATP version 4.2.0, in particular scenarios where workloads communicate in a mash fashion on a high number of ports.
These containers process data and maintain an in-memory state that in some cases can grow to exceed the limits set for the container. The main factors contributing to the size of the state are:
If you believe environment has enough resources to allow higher limits for the detectors, we suggest increasing the limits for the containers. The memory pressure on the nodes can be inspected in the NSX UI under System → NSX Application Platform. The page provides a detailed summary of the available memory on each node.
To increase the limits for the containers, patch the llanta-detectors
statefulset resource via the following Kubernetes command:
For example, to set the memory limit to 12Gi for the llanta-service
container, execute:
Delete the llanta-detectors-0
pod and let Kubernetes restart it with the new resource values:
|
NOTE: This command also aligns the requests for the containers to the limits. This ensures that the cluster has enough capacity to run the pod with the updated limits.
Finally, wait a few minutes and verify the llanta-detectors-0
pod is running :
|
NOTE: If the pod fails to be scheduled, it can be examined via the following command:
|
Failed scheduling of the pod is indicated the the Events
section of the output:
|
If the cluster does not have enough capacity to accommodate the new limits, the possible solutions are:
patch
command above with lower values for the limitsIf increasing the limits is not an option in the particular environment (i.e, the memory pressure on the nodes is already high), we suggest disabling some of the affected detectors in order to reduce the size of the state. This can be done incrementally, depending on the priority that each detector has in the specific environment.
Detectors can be disabled via the NSX Manager UI under Security → Suspicious Traffic → Detector Definitions
, or Threat Detection & Response → Settings → NTA Detectors Definitions
. Below is the suggested order we advise to disable detectors (in decreasing order of memory requirements), if the customer does not have specific priorities:
llanta-service
containerllanta-service
containerTo reduce the memory guardrail threshold for the container, we need to modify the memory_limit_percentage
value in the llanta-service-env-vars
configmap. The default value for this field is 90
.
Edit the configmap via
|
Choosing the new value for the option option (e.g, lowering from 90 to 60):
|
Delete the llanta-detectors-0
pod and let Kubernetes restart it with the new configuration value:
|
NOTE: This workaround affects the amount of baseline that is retained for the affected NTA detectors. When the guardrail is hit, the llanta-service
container will drop some of the baseline to reclaim some memory. We suggest keeping this Solution as a last option in case increasing the limits for the pod is not possible.
If you are following Solution 1 to solve this issue, you should also include following
NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0` - `llanta-job*` container
or that issue will likely be hit later