DLP Endpoint : java.lang.OutOfMemoryError: Java heap space(Endpoint Server)

Products

Data Loss Prevention Endpoint Prevent

Issue/Introduction

Endpoint servers maxing out aggregator.

Error:

Message: Stack array is empty. The following exception does not have a proper stack trace.
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at com.symantec.dlp.communications.common.activitylogging.ConnectionLogger.getThrottler(ConnectionLogger.java:553)
at com.symantec.dlp.communications.common.activitylogging.ConnectionLogger.shouldSuppressHSL(ConnectionLogger.java:506)
at com.symantec.dlp.communications.common.activitylogging.ConnectionLogger.writeToLogFileIfNeeded(ConnectionLogger.java:473)
at com.symantec.dlp.communications.common.activitylogging.ConnectionLogger.writeToLogs(ConnectionLogger.java:459)
at com.symantec.dlp.communications.common.activitylogging.ConnectionLogger.onReplicatorException(ConnectionLogger.java:1161)
at com.symantec.dlp.communications.common.activitylogging.AsynchronousConnectionLogger$ReplicatorExceptionTask.run(AsynchronousConnectionLogger.java:2414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space

Or
java.lang.OutOfMemoryError: GC overhead limit exceeded"

Environment

All

Cause

There can be several causes here
1. Policy complexity(typically applicable to 15.8 and earlier)
2. Immature policy editing practices. .
3. Bad Load balancer configuration
4. Bad agent comm layer settings.

Resolution

Policy Complexity(15.8 and earlier)

Review the FileReader logs look for:

com.vontu.policy.loader.execution.ExecutionMatrixGenerator sizeInRows

Consider tuning policies that consist of > 10,000 rows

https://knowledge.broadcom.com/external/article/174430/high-memory-or-cpu-usage-of-the-dlp-agen.html has important tips on how to avoid policies with too many rows in 15.8 and earlier.

Policy Authoring Best Practices.

In addition to policy complexity, it is a good idea to have controls in place that limit the quantity and/or frequency of policy updates. While more efficient in 16.0 and later. It's important to know that agents start receiving new policies the moment you click save on a policy in a policy group applied to endpoint servers. Because of this, when a dozen policy changes are made back to back, this process gets restarted over and over and can get expensive and taxing on both agents and servers CPU and memory.

Consider doing policy edits during change control windows
If many policy updates are necessary simultaneously, consider stopping the DetectionServerControllerService on Enforce while these changes are made, this will cause the policy changes to be treated as a single transaction.
Carefully consider impact when increasing policy and/or data identifier maximum match counts.
Carefully consider retaining the original message attachment for endpoint agents.

Load Balancer Configuration

DLP agents behind with a load balancer between them and the endpoint server(s) need to have that load balancer configured for source IP persistence. For reference see
About using load balancers in an endpoint deployment

Neglecting the use of IP persistence(also called IP stickiness) Can cause endpoint servers to frequently not see agents for many hours, triggering the endpoint server to report to Enforce that the agent is not reporting. This is based on the 'Configuring Agent Connection Status "Not Reporting" after' setting. This leads to several things happening

If the agent has been connected to other server(s) for the not reporting interval(18 hour default) the Endpoint Server will attempt to report the agent as 'not reporting' this will be ignored if enforce has received a newer update, but this is still wasted effort for this task.
The endpoint server will 'forget' about this agent, removing agent details from it's internal cache, this creates Garbage Collection elements. (As in the 'GC' in 'GC Overhead')
The next time the server 'does' see this 'not reporting' agent, it no longer has common data such as OS and agent version info, and must request this data again from the agent, frequently making agent communications more expensive.

These things all make communicating with agents more expensive, and thus consume more resources, this is avoided by utilizing IP persistence on the load balancer.

Load balancers are also often tasked with performing health checks on the endpoint server. As agent connections to the endpoint server are not persistent, it is not necessary to have health check frequency measured in milliseconds. Having health checks kick off dozens of times a second can have a negative impact on Endpoint Server stability and performance.

Agent Communication Layer Settings

In general advanced agent settings beginning with 'CommLayer' or 'ServerCommunicator' should be left at their default values as they are interdependent on each other in many cases and changing one without properly changing adjacent settings can result in negative impact to agent and server communication.

Common mistakes made in relation to agent comm layer settings.

ServerCommunicator.CONNECT_POLLING_INTERVAL_SECONDS.int set to 10-30 seconds for thousands of agents. Fast polling intervals should be limited to test agents numbering fewer than 10.
'CommLayer.NO_TRAFFIC_TIMEOUT_SECONDS.int' set to '0' This setting MUST be higher than EndpointCommunications.HEARTBEAT_INTERVAL_IN_SECONDS.int. failure to do so, will cause the agent to disconnect ungracefully and thus try to connect to the endpoint server again immediately. If the endpoint server is already overwhelmed this will not help. If set to '0' this will cause agents to disconnect the moment they check this timer. This can cause long term server memory use problems and even long delays in incident/status reporting.

When all else fails.

If all else fails and we are still encountering
java.lang.OutOfMemoryError: GC overhead limit exceeded or 'java.lang.Exception: java.lang.OutOfMemoryError: Java heap space'

within Aggregator#.log files It may simply be time to increase memory available for the endpoint server component.

Within the Advanced Server Settings

For 16.0.x and earlier
Find BoxMonitor.EndpointServerMemory
Increase the value of the -Xmx setting to a size appropriate for the available physical memory on each Endpoint Detection Server.

For 16.1 and later
Find UDS.Detector.MaxMemory
Increase the value to a size appropriate for the available physical memory on each Endpoint Detector.

Continue to monitor aggregator to see if the service stays running and is stable.

Additional Information

Adjust the "maximum matches count" in a DLP policy incident