How to perform an APM Cluster Performance Health Check.

Products

CA Application Performance Management Agent (APM / Wily / Introscope) INTROSCOPE

Issue/Introduction

The Perflog can reveal much about the health of the Collectors in a cluster. This article discusses the most easily diagnosed issues. For a complete analysis, CA Services should be engaged for a comprehensive health check.

Environment

All supported APM releases.

Resolution

Perform these four steps for an initial health check.

Step 1: transport.outgoingMessageQueueSize

Search the IntroscopeEnterpriseManagerSupport.log for transport.outgoingMessageQueueSize
If there is no entry for this property in the Support log, then the default of 3000 is used.
To improve message queue throughput between the MOM and Collectors, it is recommended to increase this property to 6000 along with increasing the high concurrency pool size to 10 on the MOM and all Collectors.
Important Warning : While increasing these properties will help Introscope, increasing them too high can cause Introscope to consume many more resources. The outgoingMessageQueueSize should not be set higher than 9000, and pool min and max should not exceed 20.
In IntroscopeEnterpriseManager.properties, set:
- transport.outgoingMessageQueueSize=6000
- transport.override.isengard.high.concurrency.pool.min.size=10
- transport.override.isengard.high.concurrency.pool.max.size=10
A restart of all EMs is required.

Step 2: Max heap size

Check the JVM startup parameters for initial heap size (-Xms) and max heap size (-Xmx) for the MOM and all Collectors.
These values should be equal and as large as possible for the bit-ness of the Java JVM used and amount of RAM available on the box.
If the EM is running on a 32-bit JVM (APM 9.7):
- Max heap limit is 1.5gb for Windows
- Max heap limit is 2.0gb for Linux
If the EM is running in a 64-bit JVM, then the maximum heap size is limited only by the amount of ram available on the box.
Edit [EM_HOME]/Introscope_Enterprise_Manager.lax or EMService.conf (if the EM is started as a service)
In this example, -Xms and -Xmx are both set to 8192m:
- lax.nl.java.option.additional=-Xms8192m -Xmx8192m -Djava.awt.headless=false -XX:MaxPermSize=256m -Dmail.mime.charset=UTF-8 -Dorg.owasp.esapi.resources=./config/esapi -showversion -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xss512k
A restart of the EM is required.

Step 3: Prepare to Analyze Perflog.txt

For each Collector in the cluster, rename [EM_HOME]/logs/perflog.txt to perflog.csv. If Windows doesn't allow you to rename a file in use, copy the file and then rename it instead.
Open perflog.csv with Microsoft Excel
Select the first row containing titles
- Right-click
- Select Format Cells
- Select Alignment
- Select Wrap Text
- Click Ok
Delete any rows above the first row containing titles
In the Menu bar:
- Select View
- Click Freeze Panes
- Click Freeze Top Row
With the first row selected:
- Click Data
- Click Filter
- Filter buttons will be added to each column
Click File, Save As
- Save the file as the file extension/type of .xlsx

Step 4: Analyze Perflog.xlsx

The Perflog can reveal much about the health of the Collectors in a cluster. This article discusses the most easily diagnosed issues. For a complete analysis, CA Services should be engaged for a comprehensive health check.

Performance metrics are written to the Perflog at 15 second intervals. To see a summary of all the values in any column, click the Filter button for that column and scroll through the contents of the Filter window.

Total JVM Memory - Column B

Column B reports the total memory available to the JVM. If initial heap (-Xms) and max heap (-Xmx) are equal, this number should not change much over time since the maximum heap will be allocated immediately at startup instead of being acquired by the JVM as needed.

Total JVM Free Memory - Column C

Column C reports the amount of available free JVM memory in any interval. Look for occurrences of free memory dropping to a 2-digit number or less on the Collector. If you see this, increase the heap size available to the JVM. Add memory to the server if necessary. However, if you already have sufficient JVM memory allocated, then proceed by further investigating the rest of the columns. It is unusual to see this problem on a MOM.

Harvest Duration - Column F

Column F reports the Harvest Duration. This is the amount of time the Collector is taking to aggregate 15 second interval metrics in preparation for writing them to the Smartstor database. If Harvest Duration frequently exceeds 3000ms (3 seconds), this is a sign that the Collector is struggling to aggregate the incoming interval metrics. The Collector is overloaded.

Smartstor Duration - Column G

Column G reports Smartstor Duration. This is the amount of time the Collector is taking to write harvested data to disk. Values of 5000ms (5 seconds) or more should be addressed. CA recommends using a separate disk on a dedicated controller to store Smartstor data. Check the location of the

Smartstor /data directory to ensure it is not on the same disk as the Enterprise Manager itself, and check IntroscopeEnterpriseManager.properties to verify that

introscope.enterprisemanager.smartstor.dedicatedcontroller=true

when Smartstor data is on a separate, dedicated disk.

Number of Metrics (Column I) v. Metric Data Rate (Column L)

Column I (Agent number of metrics) and Column L (Agent metric data rate) should always have equal or very close values. Scroll through the spreadsheet to compare these two columns. The Agent metric data rate says how many metrics were processed in an interval. If the metric data rate (Column L) is much lower than the number of metrics coming in (Column I), it is a clear indication that the Collector cannot cope with the number of agent metrics it is processing. The Collector is overloaded.

Number of Connected Agents - Column J

Column J reports the number of Agents connected to this Collector. The maximum number of Agents allowed per Collector is 400. Note this value in each Perflog for each Collector in the cluster. If the number of Agents is unbalanced across the cluster, such that some Agent Collectors (but not TIM Collectors) are supporting more Agents than others, then look for load balancing issues in [EM_HOME]/config/loadbalancing.xml.

By default, all agents should be configured to point to the MOM. The MOM will assign Agents to Collectors automatically and enforce load balancing across the cluster at 15 minute intervals.

If you have Agents configured to talk to specific Collectors and not the MOM, this could negatively affect load balancing.
If you have manually configured loadbalancing.xml to force some Agents to specific Collectors, this could negatively affect load balancing.
If you have TIMs reporting to your cluster, you must designate one or more Collectors to be TIM Collectors. TIM Collectors should have no agents reporting to them.

A cluster consists of one MOM and a maximum of 10 Collectors of all types, including TIM Collectors. Adding more than 10 Collectors to a cluster can negatively impact the performance of the MOM.

Number of Traces - Column X

Column X reports Performance Transactions Number of Traces. This is the number of traces arriving in any 15 second interval from all Agents reporting to this Collector. The maximum allowed for any one Collector is 500,000.

If the number of traces coming in exceeds this number, then consider disabling socket, file, and network I/O traces on all agents to reduce the metric load. To find out which tracer types are reporting the most traces, it is recommended to disable each type one at a time, then examine the perflog again for improvement.

To disable traces, check to see which PBL file you are using in [AGENT_HOME]/wily/core/config/IntroscopeAgent.profile by checking the directives property:

introscope.autoprobe.directivesFile=websphere-typical.pbl,hotdeploy

Here, we are using websphere-typical.pbl

Checking in websphere-typical.pbl, we see that toggles-typical.pbd is called.

Edit toggles-typical.pbd and comment out the TurnOn directives for socket, file, and network I/O traces as shown:


 	#######################
 	# Network Configuration
 	# ================
 	#TurnOn: SocketTracing
 	# NOTE: Only one of SocketTracing and ManagedSocketTracing should be 'on'. ManagedSocketTracing is provided to
 	# enable pre 9.0 socket tracing.
 	#TurnOn: ManagedSocketTracing
 	#TurnOn: UDPTracing
 	#######################
   # File System Configuration
   # ================
   # TurnOn: FileSystemTracing
   #######################
   # NIO Socket Tracer Group
   # ================
   #TurnOn: NIOSocketTracing
   #TurnOn: NIOSocketSummaryTracing
   #TurnOn: NIOSelectorTracing

A restart of the monitored application will be required.