The health status of the NSX Intelligence feature is reported as DOWN in the UI.
The DOWN status may be caused by the llanta-detectors-0
pod running out of memory.
By Running the below commands on am NSX Manager while you have an ssh session as root you will see the following output snippets:
The llanta-detectors-0
pod is reported in the CrashLoopBackOff status:
|
One or more of the llanta-job-netflow-beaconing, llanta-job-time-series
containers in the llanta-detectors-0
pod are reported as having crashed after running out of memory (OOM):
|
If you see a different container listed than the one listed for this KB, please search to the correct KB with the container listed.
This issue has been observed on ATP version 4.2.0, in particular scenarios where workloads communicate in a mash fashion on a high number of ports.
These containers process data and maintain an in-memory state that in some cases can grow to exceed the limits set for the container. The main factors contributing to the size of the state are:
If you believe environment has enough resources to allow higher limits for the detectors, we suggest increasing the limits for the containers. The memory pressure on the nodes can be inspected in the NSX UI under System → NSX Application Platform. The page provides a detailed summary of the available memory on each node.
To increase the limits for the containers, patch the llanta-detectors
statefulset resource via the following Kubernetes command:
For example, to set the memory limit to 10Gi for the llanta-job-*
containers, execute:
Delete the llanta-detectors-0
pod and let Kubernetes restart it with the new resource values:
|
NOTE: This command also aligns the requests for the containers to the limits. This ensures that the cluster has enough capacity to run the pod with the updated limits.
Finally, wait a few minutes and verify the llanta-detectors-0
pod is running :
|
NOTE: If the pod fails to be scheduled, it can be examined via the following command:
|
Failed scheduling of the pod is indicated the the Events
section of the output:
|
If the cluster does not have enough capacity to accommodate the new limits, the possible solutions are:
patch
command above with lower values for the limitsIf increasing the limits is not an option in the particular environment (i.e, the memory pressure on the nodes is already high), we suggest disabling the detectors that the two job containers are handling.
Detectors can be disabled via the NSX Manager UI under Security → Suspicious Traffic → Detector Definitions
, or Threat Detection & Response → Settings → NTA Detectors Definitions
. The detectors that can be disabled to release memory pressure on the llanta-job-*
containers are:
If you are following Solution 1 to solve this issue, you should also include following
NSX Intelligence is Down due to Out Of Memory issues in NTA POD `llanta-detectors-0` - `llanta-service` container
or that issue will likely be hit later