My Enterprise Manager is crashing when applying TOP N filter on a dashboard reporting SQL or Strut metrics. The heap utilization is increasing more and more until the EM crashes with out of memory errors. Eg: if we display the top 10 struts or SQL during 1 day time range it shows high heap utilization and if more than 2 days then the EM crashes and the EM logs report below messages:
[WARN] [master clock] [Manager.Clock] Timeslice processing delayed due to system activity. Combining data from timeslices 96397082 to 96397084
[WARN] [master clock] [Manager] EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods.
[WARN] [Async MDQ 1] [Manager] Timed out adding to outgoing message queue. Limit of 6000 reached. Terminating connection: Node=Workstation_24, Address=172.xx.x.xxx/172.xx.x.xxx:4244, Type=socket
[ERROR] [Harvest Engine Pooled Worker] [Manager.Agent] java.lang.OutOfMemoryError: GC overhead limit exceeded
Most likely the issue is appearing because the query is returning too many data points. Top N is a way to qualify a graph on an Introscope dashboard so that only the Top N (where you pick the N) metrics display. Processing Top N graphs requires many Enterprise Manager resources. For example, you can set up a metric group that queries the average response time for 100,000 struts in your system. On a dashboard, you have a graph displaying the five slowest struts. The Enterprise Manager has to subscribe to and process the data for all 100,000 servlets to determine the five slowest. Top N graph calculators query a large number of metrics, but return only a small number of metrics to the client. For this reason, Top N graph calculators do not benefit from CA APM large query optimizations. Use Top N graphs sparingly. Whenever a Top N request is made all the data is provided in real time, which puts a large resource demand on your Introscope system.
Note that in a cluster, each Agent, Trace, Agent to EM and EM to EM communications as well as WebView and Workstation queries can add load onto the cluster itself. Therefore, running additional queries, like CLW and JDBC, outside of WebView or Workstation, can impact EM performance as well. Consider running these outside queries off-hours and limit querying and returning large sets of data.
All APM releases and environments.
If Top N and CLW queries create ongoing problems in your environment, you can set the properties mentioned in step 2, to limit the query resource consumption. By default, these properties are set to 0 (which means that there is no clamp). Please do the following:
1. Make sure the EM is configured and tunned by following below TEC DOC "APM Cluster Performance Health Check":
2. Update the following properties within the EnterpriseManager.properties file as below:
3. Restart the Enterprise Manager
NOTE: If running additional outside queries as noted above, if you find your cluster performing slowly while these queries are executing, you may want to lower the limits on the above properties. Please note that anytime you change those properties above, you need to restart the EM.