APM Health Tip - How to improve the performance of APM

Products

CA Application Performance Management Agent (APM / Wily / Introscope) INTROSCOPE

Issue/Introduction

So how are we going to improve the performance of APM? The first question which should come in mind is why APM is performing slowly? Finding an answer for this will help us to improve the performance of APM. There are multiple underlying factors causing APM to behave slowly. Let’s see the top most ones which could affect your cluster’s performance.

Environment

All Application Performance Management releases.

Resolution

As a first step for new APM initiative or upgrade.

OK. This is great, and I have followed Broadcom's recommendation, but even then, I am facing performance issues. So, how do I resolve them? Before you incorrectly flag it as a product issue, let us have a look at our environment to understand what is going wrong.

We need to analyze APM behavior to understand what is wrong on our environment. Let’s follow a structured approach for a better performance of APM.

Is agents/metrics capacity breached its threshold?
Are harvest cycle runs for a longer time?
Are there any issues on smartstor I/O operations?
Are calculators/metric group regex generating lots of metrics?

There are easy ways to find answers for these questions. The place you need to look at is “Custom Metric Host (*Virtual*)|Custom Metric Process (*Virtual*)|Custom Metric Agent (*Virtual*) (SuperDomain)|Enterprise Manager” on your workstation. If your APM is running on clustered environment, you need to look for these data on all EMs. Once you are there, take a screenshot of overview keeping the time range to 6 hours. Also take another screenshot keeping the time range to 30 days. This is required to compare the growth rate between 6 hours and 30 days.

To analyze the APM capacity, look at the below metrics:

Number of Agents
Number of Metrics
Number of Historical Metrics
Number of Metrics Handled

Usually, a single EM collector in a cluster can handle up to 400K metrics and up to 400 connected agents. This is applicable only when appropriate hardware used as per Broadcom recommendations. (When using 8 GB RAM with 8 core processor and appropriate hardware). For example, 10 agents at 40K live metrics will impact the capacity of the collector, likewise, (400) agents at 1K live metrics. Based on this or on your current setup, check agent’s capacity has breached its threshold. If it’s breached, the best advice is to add a new collector considering your agents are optimized to send only required metrics and smartstor can’t be dropped.

Now you have agents within limit, but metrics limit is growing suddenly. When seeing a sudden spike in “Number of metrics”, it indicates some agent is sending too much of metrics and this indicates a metric explosion. Another easy way to find/confirm if an individual agent is sending high metrics is to look at APM status console. APM status console will display introscope.enterprisemanager.agent.metrics.limit clamp per EM which indicates some agent crossed its metrics limit. There are also some situations where agent will be consistently sending too much of metrics and noticed only when there is a metric explosion.

Once seeing the clamp, we need to identify which agent is causing metric explosion. We have “Metrics by agent” metric grouping under Supportability management module to get a glance of agent causing metric explosion. You can also pull an ‘Agent Summary report’ to understand the metrics count for all agents. So now we identified problematic agents using the reports.

What to do next?

Go to the agent node and click on Metric Count Tab. This will give you nice indication of individual metrics count on a pie chart. The bottom page on Metric Count Tab also shows the percentage/count of metrics for an individual agent. Based on that, you can drill down further to identify even more individual metrics count. Once you find this, start working on fine-tuning agent to send only required metrics. The solution for most common causes of too many SQL metrics, excessive URL’s, too many JMX metrics are SQL normalization technique, URL grouping, JMX metrics reduction etc.

OK. Let’s have a look at historical metrics count under “Number of historical metrics”. If we see a spike (sudden or consistent) in historical metrics, this indicates historical metrics is growing and will breach its threshold one day. Looking at APM status console will confirm if historical metrics has been breached its threshold value. introscope.enterprisemanager.metrics.historical.limit clamp is a nice indicator to show per EM’s historical limit. The solution for this is to trim excessive metrics from smartstor to provide some space for live metrics flow. We have smartstor cleanup procedures to achieve this.

There are certain cases where live/historical metrics are within threshold, but still there are performance problems. In such situations, look at ‘Number of metrics handled’. This shows the metrics which are processed by calculators, virtual agents, metric grouping regex etc.. To find the problematic calculator generating huge metrics, introduce calculators one by one to EM to isolate the calculator having problem and fine-tune those. Also same applies to find which management module is having problem. You need to introduce one by one to find the problematic module. Optimize calculators/RegEx using proper RegEx techniques.

Now you should be able to determine APM capacity levels and its threshold breach. But how do we analyze the performance of APM? Most of the cases assuming all Broadcom recommendations are followed, above are the reasons for performance issues. Let us see how to conclude this. Look at below metrics to confirm the performance issues.

Tasks:Harvest Duration (ms)
Tasks:Smartstor Duration (ms)
GC Heap:GC Duration (ms)

Harvest duration metric spikes when there is sudden flow of metrics to collector EM which causes it to take more time to process the metrics. In case of MOM, metric spike is due to alerts processing, calculator harvest time, etc. So harvest duration should be below 3.5 secs. This is a nice indication of EM performance issue if it goes above 3.5 seconds.

SmartStor duration metric spikes when there is sudden flow of metrics to collector EM which takes time to insert data to smartstor DB. Also when in cluster, due to sudden increase in agents (when load balanced) to an EM can cause SmartStor duration to go high because which in turn those agents will send excessive metrics suddenly.

Also when Smarstor is already running with huge historical metrics (huge metadata as well), SmartStor I/O operations will be affected meaning query duration spikes when try to query data from smartstor. Smartstor duration metric should also be below 3.5 seconds. Anything greater than 3.5 seconds is a nice indication of EM capacity issues which in turn affects performance.

GC duration is another indication of performance issue where high harvest or smartstor duration can cause GC to run slowly. Also when there is heap size configuration issues, GC duration will be affected. GC duration should be always lesser than harvest and SmartStor duration.

Now we have analyzed the APM performance and capacity using workstation.

There are certain situations when things goes out of hand where collectors disconnecting from MOM continuously. On those situations, we can’t collect performance data using workstation. Don’t worry. We have perflog.txt to analyze capacity and performance issues. Use perflog.txt procedure to filter the data required. Once you followed the guidelines on above link, you can use excel pivot table technique and draw graph/charts using the values. These will give you all metrics/graph mentioned above.

Hurrah! You know now what is happening on your environment and hopefully you have fine-tuned agent metrics, calculators, metric grouping regex, SmartStor cleanup.

But Let’s have a look on EM performance tips as well to correct the configuration/solutions other than what was mentioned above.