APM 10.x Troubleshooting guide and Best Practices
search cancel

APM 10.x Troubleshooting guide and Best Practices

book

Article ID: 93176

calendar_today

Updated On:

Products

DX Application Performance Management CA Application Performance Management (APM / Wily / Introscope) CA Application Performance Management Agent (APM / Wily / Introscope) APM

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Introscope EM 10.x common performance and configuration issues.

A) Common issues
B) Checklist and Best Practices
C) What diagnostic files should I gather for CA Support?

Environment

Valid for APM 10.7 and 10.8 onpremise

Resolution

 

IMPORTANT NOTE:   

If you notice a performance degradation after applying any of the below recommendations, for example, MOM-Collector disconnections, OutOfMemory issues, unable to connect to the workstation, empty dashboards, etc,  undo the changes and contact Broadcom Support Team for advice. 

 

A) Common Messages

Source: EM-HOME/log/IntroscopeEnterpriseManager.log

 

1. MOM / Collectors Clamps being reached

"The EM has too many historical metrics reporting from Agents and will stop accepting new metrics from Agents.  Current count"


Recommendation:
increase introscope.enterprisemanager.metrics.historical.limit   (default value in 10.7SP3 is 30000000)

EM_HOME/config/apm-events-thresholds-config.xml
Make sure to apply the change in all the EMs (MOM and collector)

There is no need to restart the EMs

 

"The EM has too many live metrics reporting from Agents  and will stop accepting new metrics from Agents."

Recommendation:
increase introscope.enterprisemanager.metrics.live.limit  (default value 500000)

EM_HOME/config/apm-events-thresholds-config.xml
Make sure to apply the change in all the EMs (MOM and collector)

There is no need to restart the EMs

 

"Collector <collector-name>@<port> reported Clamp hit for MaxAgentConnections limit."
"Reporting Clamp hit for MaxAgentConnections limit to MOM"

 

​Recommendation:
increase introscope.enterprisemanager.agent.connection.limit   (default value 400)

EM_HOME/config/apm-events-thresholds-config.xml
Make sure to apply the change in all the EMs (MOM and collector)

There is no need to restart the EMs

 

"The Agent <your.agent> is exceeding the per-agent metric clamp "
 
Recommendation: increase introscope.enterprisemanager.agent.metrics.limit   (default value 50000)

EM_HOME/config/apm-events-thresholds-config.xml
Make sure to apply the change in all the EMs (MOM and collector)

There is no need to restart the EMs 

 


2. Tune Client Message Queues

"Timed out adding to outgoing message queue. Limit of <#> reached."

Recommendation:

-Open the the EM-HOME/config/IntroscopeEnterpriseManager.properties
-Set transport.outgoingMessageQueueSize=8000
If it is already 8000, increase the value by 2000, adjust the value as required

- Add the below 2 properties in all the EMs (MOM and collectors)
transport.override.isengard.high.concurrency.pool.max.size=10
transport.override.isengard.high.concurrency.pool.min.size=10

A restart of the EMs is required for the changes to take effect.

NOTE: Increasing the outgoing message queue allows you to have a bigger buffer.  Increasing the thread pool size allows you to have more worker threads to send outgoing messages. These important adjustments are required when, sending messages, usually between collectors and MOM, becomes a bottle neck for performance.

3. Operating system issues

"java.io.IOException: Too many open files"
 
Recommendation: (unix only)
Make sure the maximum number of open files is at least 4096 in both MOM and Collectors, you can check the current setting by using "ulimit -a"
You can increase the current setting as below for example:

ulimit -n 16384
or
ulimit -n unlimited



java.io.IOException: No space left on device
 
Recommendation:
Increase HD space as soon as possible to prevent a database (smartstor, traces, heuristic) corruption.
 


4. Traces database:


"Uncaught Exception in Enterprise Manager:  In thread Lucene Merge Thread #1 and the message is org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException"

OR

"[ERROR] [main] [Manager] The EM failed to start. Unable to open the Perst backing store" 

Recommendations: 
This message indicates that the Trace database or index is corrupted

a) Reindex Traces DB

-Stop the EM
-Delete the
<EM-HOME>/traces/index folder
-Start the EM -- this process will
re-index the db, however, if the traces db is corrupted you need to start the EM with a new traces db as below

b) Start EM with a new Traces DB 

-Stop the EM
-Delete the
<EM-HOME>/traces folder
-Start the EM


NOTE: Default Trace DB home location is <EM-HOME>/traces, if you are unsure open the config/IntroscopeEnterpriseManager.properties, check the introscope.enterprisemanager.transactionevents.storage.dir property
 


5. Cluster capacity or configuration issue:


"Outgoing message queue is not moving"
"Outgoing message queue is moving slowly"
"EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods."

 
Recommendations:
Find out if the messages are due to a recent change in the MOM and Collectors to accommodate the load

a) Check if the number of agents and/or live metrics has increased
You can find out this by reviewing the EM_HOME/logs/perflog.txt (Performance.Agent.NumberOfAgents and Performance.Agent.NumberOfMetrics columns) from collectors, for more information refer to
APM 10.x - How to read the Perflog.txt

 


b) Check if any of the below clamps in the EM_HOME/config/apm-events-thresholds-config.xml has recently been increased:
introscope.enterprisemanager.agent.metrics.limit (default value 50000)
introscope.enterprisemanager.agent.connection.limit (default value 400)
introscope.enterprisemanager.metrics.historical.limit (default value in 10.7SP3 is 30000000)
 
If this was the case, restore the values. These are hot properties, there is no need to restart the Introscope EM. 

c) Check each of the "Best practices" covered in the next section
 


6. Lack of memory

 

"java.lang.OutOfMemoryError"
 
Recommendations:
- If you find the OutOfmemory exception in the EM-HOME/logs/IntroscopeEnterpriseManager.log:

Open the Introscope_Enterprise_Manager.lax, , update lax.nl.java.option.additional property
Increase EM memory heapsize -Xmx by 2GB

- If you find the error in the WEBVIEW-HOME/logs/IntroscopeWebview.log:

Open the Introscope_Webview.lax, , update lax.nl.java.option.additional property
Increase memory heapsize -Xmx by 2GB

 

"[ERROR] [Trace Insertion] [Manager] Uncaught Exception in Enterprise Manager: In thread 'Trace Insertion' and the message is: java.lang.StackOverflowError
java.lang.StackOverflowError
        at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:298)"

Recommendation:
Open Introscope_Enterprise_Manager.lax, update lax.nl.java.option.additional property
Increase -Xss, d
efault value is 512k, you can set -Xss=2048k


7. Too many traces

 

"The Enterprise Manager cannot keep up with incoming event data"
 
Recommendation:
Reduce the traces incoming rate, see below "Checklist" section, point # 11
 


8. Too many alerts

 

"Processing of alerts is overloaded, ignoring <xxx> new alerts!"
 
The above message indicates that a lot of Alerts are automatically created and propagated to AppMap
Symptoms: overhead in memory, GC, harvest duration and disk space due to the extra alert states changes
 
Recommendation:
 
a) Try reducing uvb metric clamp introscope.apmserver.uvb.clamp from default 50000 to 1000, this will reduce the metric handling for Differential Analysis alerts.

Apply this change in the MOM only
 
b) Adjust the Differential Analysis (DA) configuration : reduce or stop the action notifications by:
- Excluding the specific frontend applications
- Excluding the specific frontend applications and then creating a separate differential control element specifically for that frontend (which allows fine-tuning of notifications).
- A simple way to reduce the number of notifications is to add actions to the danger list only.

c)  Increase auto tracing triggering threshold (default introscope.enterprisemanager.baseline.tracetrigger.variance.intensity.trigger=15)

Add hidden property : introscope.enterprisemanager.baseline.tracetrigger.variance.intensity.trigger=30

This will reduce auto generated traces from DA.  


9. Too many traces 

"TransactionTrace arrival buffer full, discarding trace(s)"
 
Team Center uses Transaction trace data as the source for the Team Center Map by default. This causes the MOM to retrieve the Transaction Traces Data from Agents, occasionally which slows down the Collectors
 
Recommendation:
Open the the EM_HOME/config/IntroscopeEnterpriseManager.properties
locate introscope.enterprisemanager.transactiontrace.arrivalbuffer.capacity=
The default is 2500, you can try to increase the value to 5000, the impact will be on memory, you might need to increase the EM heap size
 


10. Network Slowness affecting MOM to Collectors communication

"Collector clock is too far skewed from MOM. Collector clock is skewed from MOM clock by XXXX ms. The maximum allowed skew is 3,000 ms. Please change the system clock on the collector EM."

"The Collector ..... is responding slower than 10000ms and may be hung "

 

From documentation:

"Whenever possible, place a MOM and its Collectors in the same data center; preferably in the same subnet. Workstation responsiveness is adversely affected when the Collector-MOM connection crosses through a firewall or any kind of router. If latency is too high, the MOM disconnects from Collectors. If the MOM and Collector are across a router or, worse yet, a packet-sniffing firewall protection router, response time can slow dramatically. The MOM disconnects from any Collector that has any of these conditions:

a) Appears unresponsive through the network for more than 60 seconds (see information about the ping time threshold below).

b) The Collector system clock appears skewed more than 3 seconds from the MOM clock."

 

Recommendations:

a) Make sure the MOM and collectors are located in the same subnet

b) Collector system clocks must be within 3 seconds of MOM clock setting. Ensure MOM and collectors synchronize their system clocks with a time server such as an NTP server otherwise EMs will disconnect

 


11. EM overloaded

 

"Internal cache is corrupt. Cannot determine class type for Object 1. A prior class deserialization error may have corrupted the cache. "

This message indicates a capacity issue, review next section


12. Baseline database


"failure to add data to baseline"

OR

[ERROR] [com.ca.apm.baseline.em.Baseline] 

org.garret.perst.StorageError: Object access violation: java.io.StreamCorruptedException: invalid type code: 50
    at org.garret.perst.impl.StorageImpl.loadStub(StorageImpl.java:2831)


 
This message indicates that the Baseline database in the EM or Collector is corrupted

Recommendation: 

- Stop the Introscope EM or collector
- Delete the variance.db
- Start the EM again

If you are unable to locate the file

- open the IntroscopeEnterpriseManager.properties
- locate the introscope.enterprisemanager.baseline.database property, for example:

introscope.enterprisemanager.baseline.database=/introscope/data/variance.db
 


13. Too many alert status changes


"[Manager.AppMap] Clearing statusChanges due to exceeded size"

Clearing statusChanges" message indicates there are too many alert status changes that we are trying to propagate to ATC, or the processing is too slow to keep up with incoming changes 

Suggestions: 
a) For testing purpose disable the appMap alerting mapping temporarily in the EM (empty teamcenter-status-mapping.properties, take a backup first). If the problem doesn't occur then you need to narrow down which Alarm(s) "Propagation" are causing the issue.
b) Turn of the Propagation of the Variance Intensity to Team Center, uncheck the "propagate to team center check box" from the response time variance alert in the default mgmt module

 

14. EM not visible in Workstation

A collector starts sucessfully however, it is not visible from workstation. In the MOM log you find the below message:

"[WARN] [PO Async Executor] [Manager.Cluster] Ignoring duplicate collector xyz@5001"

Suggestions: 
a) Check for any possible firewall issue, if possible try to disable it temporarily

b) Switch off "Control Minder"

c) Review your Local security policy as they might need to be reconfigured.

 


B) Checklist and Best Practices

Below is the list of common configuration problems that affects the Introscope EM or Cluster performance:

 

1. Heap size memory

Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx), since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

- In a unix setup you need to update the EM_HOME/Introscope_Enterprise_Manager.lax

lax.nl.java.option.additional

- In a window setup you need to update the EM_HOME/bin/EMService.conf (windows)

wrapper.java.initmemory=

wrapper.java.maxmemory=

 

2. Management Modules

Management modules should only be deployed in the MOM. Make sure the collectors do not start with any Management modules to prevent any unnecessary extra load.

- Stop the collector(s) only

- Rename EM_HOME/config/modules as modules_collector_backup

- Create an empty "modules" directory

- Start the collector

 

3. If you are using JVM 1.8, ensure that G1 GC is in use.

 - Open the EM_HOME/bin/EMService.conf (windows) or EM_HOME/Introscope_Enterprise_Manager.lax (other platforms)

- Remove the following java arguments if they are present : -XX:+UseConcMarkSweepGC and -XX:+UseParNewGC

- Add : -XX:+UseG1GC and -XX:MaxGCPauseMillis=200

 

For example:

- config/EM_HOME/Introscope_Enterprise_Manager.lax:

lax.nl.java.option.additional=-Xms20480m -Xmx20480m -Djava.awt.headless=true -Dmail.mime.charset=UTF-8 -Dorg.owasp.esapi.resources=./config/esapi -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xss512k

 

-bin/EMService.conf

wrapper.java.additional.7=-XX:+UseG1GC

wrapper.java.additional.8=-XX:MaxGCPauseMillis=200

 

4. Configure correctly the EM to run as a nohup process (unix only)

When running the EM in unix you also need to perform this step manually. 

Open the EM_HOME/Introscope_Enterprise_Manager.lax, locate the property lax.stdin.redirect, update it as below:

 

Replace: lax.stdin.redirect=console

with: lax.stdin.redirect=

From : Run the Enterprise Manager in nohup Mode on UNIX   "Note: Only run the Enterprise Manager in nohup mode after you configure it. Otherwise, the Enterprise Manager might not start, or can start and consume excessive system resources."

 

5. Disable DEBUG logging 

By default logging is set to INFO

If have enabled DEBUG logging, ensure it is disabled to prevent any impact in disk I/0.

 

6. OutOfMemory and slowness due to "heavy historical queries"

You can verify this condition by opening the Metric Browser, expand the branch (for each Collector only, not MOM)

Custom Metric Host (virtual) | Custom Metric Process (virtual) | Custom Metric Agent (virtual)(collector_host@port)(SuperDomain) | Enterprise manager | Internal | Query

  - Data Points Retrieved From Disk Per Interval    

  - Data Points Returned Per Interval

 

If introscope.enterprisemanager.query.datapointlimit > 100000, add hidden property introscope.enterprisemanager.query.datapointlimit=100000 to the IntroscopeEnterpriseManager.properties

this property defines the limit for Enterprise Manager query data point retrieval. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can retrieve from SmartStor for a particular query.

This property limits the impact of a query on disk I/0, the default value for this property is 0 (unlimited)

 

If introscope.enterprisemanager.query.returneddatapointlimit >. 100000, add hidden property introscope.enterprisemanager.query.returneddatapointlimit=100000 to the IntroscopeEnterpriseManager.properties

this property defines the limit for Enterprise Manager query data point return. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can return or retrieve for a particular query.

This property limits the impact of a query on memory, the default value for this property is 0 (unlimited)

 

As a best practice add the below 2 hidden properties in all the Collectors (or StandAlone) IntroscopeEnterpriseManager.properties:

introscope.enterprisemanager.query.datapointlimit=100000

introscope.enterprisemanager.query.returneddatapointlimit=100000

 

These properties are hot deployed, so you can adjust them without restarting EM.

 

7. Perform TCP tuning-up - increase Isengard socket receive buffer size to improve communication performance between MOM and Collectors

By default socket receive buffer size is 32K, increase the buffer size to 32MB by adding the below hidden property in all the EM-HOME/config/IntroscopeEnterpriseManager.properties files.

introscope.enterprisemanager.sockets.receivebuffersize=32768

Apply this change in all the EMs (MOM and Collectors). A restart is required



8. Apply latest HOTFIXES 

List of Hotfixes in 10.7

List of Hotfixes in 10.8


9. Huge APM database, reduce topology and its related data 

Check the size of the tables:

SELECT relname,relfilenode,relpages,(relpages*8/(1024*1024)) as disk_space_in_GB FROM pg_class ORDER BY relpages DESC;

1) Reduce the ATC data retention:

Open the MOM_HOME/config/IntroscopeEnterpriseManager.properties of the MOM(s) and ETC (if applicable)

Reduce below properties:

a) vertex/attribute/edge data retention

introscope.apm.data.preserving.time="45 DAYS"

b) states data retention (appmap_states_* tables)

introscope.apm.alert.preserving.time="45 DAYS"

Restart the EM

2) Manually prune tables so reduction take place immediately:

vacuum full analyze appmap_edges;

vacuum full analyze appmap_vertices;

vacuum full analyze appmap_attribs;

vacuum full analyze appmap_states_<timestamp> 

vacuum full analyze at_stories;

vacuum full analyze at_evidences;

 

10. You must provide a dedicated disk I/O path for SmartStor

Make sure Smatstor db is pointing to a dedicated disk controller and introscope.enterprisemanager.smartstor.dedicatedcontroller=true which allows the EM to fully utilize this setting. Failing to do this, will reduce collector performance by 50%

 

From Techdocs - Set the SmartStor Dedicated Controller Property
 
"The dedicated controller property is set to false by default. You must provide a dedicated disk I/O path for SmartStor to set this property to true; it cannot be set to true when there is only a single disk for each Collector. When the dedicated controller property is set to false, the metric capacity can decrease up to 50 percent."

 

In a SAN storage environment, each SmartStor should map to a unique logical unit number (LUN) that represents a dedicated physical disk. With this configuration only, it is safe to set introscope.enterprisemanager.smartstor.dedicatedcontroller=true.

If you are using a virtual environment, refer to:

Techdocs - VMWare Requirements and Recommendations

 

11. Too many incoming traces: 

Large amount of transaction traces on your system will impact the EM|Collector performance

 

Recommendation:

Check the EM_HOME/logs/perflog.txt, if "Performance.Transactions.Num.Traces" is higher than 1 million or increasing rapidly then try to reduce or limit the number of traces:

 a) Reduce data retention by half (default value is 14 days)

Open the EM_HOME/config/IntroscopeEnterpriseManager.properties, set introscope.enterprisemanager.transactionevents.storage.max.data.age=7

b) Clamp the transaction Traces sent by agent to EM Collectors

Open the EM_HOME/config/apm-events-thresholds-config.xml

Reduce introscope.enterprisemanager.agent.trace.limit, from default value of 1000 to 50.

This clamp limits the number of transaction events per agent the Enterprise Manager processes per interval

c) Try to isolate the issue, find out which agent(s) are causing the high amount of traces:

Open the Metric Browser,  expand the branch: Custom Metric Agent | Agents | <host> | <process> | <agentname>:Transaction Tracing Events Per Interval

 

12. Unable to connect to Webview/Workstation or connectivity is very slow

The problem could be due toe Outgoing Delivery threads getting stuck on NIO writing, try to disable Java NIO.

Open the EM_HOME/config/IntroscopeEnterpriseManager.properties, add the below hidden property:

transport.enable.nio=false

You need to restart the Enterprise Manager(s)

Apply this change in all the EMs (MOM and collectors)

 

NOTE: Disabling NIO will switch back to the previous classic socket operations to revert to the  polling architecture, there is not loss of functionality.

The main difference between Java IO and Java NIO is IO is stream oriented where caching is not there while NIO is buffer oriented and uses caching to read data and has additional flexibility due the buffering. Apart from flexibility you may have other overheads of verification before data processing and overwriting dangers. Once the data is read, it does not make any difference in what you do with it and how you handle it. Hence, using IO/NIO should not make any other difference than these known issues from JVM side. 

 

 

13. Check if the cluster is unbalanced

If see a discrepancy on metrics across the collectors, for example, some collectors have 200K and others 20K metrics

Keep in mind that the MOM load balancing mechanism only cares when a server is overloaded not underloaded.

 

Suggestions:

 a) Reduce introscope.enterprisemanager.loadbalancing.threshold=20000 to 10000 so collector load is more even across the cluster.

There is no need to restart to EM as it is a hot property but you have wait 10 minutes (introscope.enterprisemanager.loadbalancing.interval=600)

 b) Update the loadbalancing.xml to explicitly allocate the agents to the appropriate collectors.

 

 

14. Missing agent metrics  

Check if EM or Agent metric clamps have been reached.

To Check the EM clamps : Open the Metric Browser, expand the branch

Custom Metric Host (virtual) | Custom Metric Process (virtual) | Custom Metric Agent (virtual)(collector_host@port)(SuperDomain) | Enterprise manager | Connections

looks at the values for:

  - "EM Historical Metric Clamped"

  - "EM Live Metric Clamped"

 

The above metrics should all be 0.


To check the Agent clamp : expand the branch 

Custom Metric Host (virtual) |Custom Metric Process (virtual) | Custom Metric Agent (virtual)(collector_host@port)(SuperDomain) |Agents | Host | Process |<AgentName>

looks at the value for : "is Clamped" metric, it should be 0.

 

Recommendation:

-Open the EM_HOME\config\apm-events-thresholds-config.xml 

-Increase the below clamps as needed:

introscope.enterprisemanager.metrics.historical.limit

introscope.enterprisemanager.metrics.live.limit

 

This is a hot property, there is no need to restart the EM

 

15. Clear the OSGI cache

APM - How to clear the Introscope OSGI cache?

16. Start the MOM and collector in the right order 

It is always recommended to start Collectors before MOM to avoid overload MOM, but there is also a risk of overloading Collector that were started first if the Agents were not restarted. 

As a best practice you should start all Collectors at the same time and then the MOM.

 

C) What to collect if the problem persist?

 

Collect the following information from all the Introscope Enterprise Manager instances (MOM and collectors) and contact Broadcom Support.

1. EM_HOME/logs

2. EM_HOME/config

3. EM_HOME/install/*.log

4. If the EM hangs, collect a series of 10 thread dumps at 10 second intervals when the problem occurs

On Unix run: kill -3 <EM-PID>

On Windows: <jdk/jre root>\bin\jstack <PID> > jstack.txt
       Find a JDK/JRE on the local system where jstack.exe and jmap.exe exists. Output file will be created in the current directory. 


5. If the EM or Webview runs OutOfMemory, configure the EM and/or Webview to collect the generated heapdump 

Open Introscope_Enterprise_Manager.lax or Introscope_WebView.lax, append to lax.nl.java.option.additional, -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./logs/


NOTE: (Optional) You can collect a heap dump on-demand using JDK jmap, from Java directory, run:  jmap -dump:format=b,file=heapdump <process id>

            Find a JDK/JRE on the local system where jstack.exe and jmap.exe exists. Output dump file will be created in the current directory. 


6. If you are using postgres, collect 

- POSTGRES_HOME\data\pg_log\* 

- result of query:  SELECT relname,relfilenode,relpages,(relpages*8/(1024*1024)) as disk_space_in_GB FROM pg_class ORDER BY relpages DESC;

Additional Information