How to improve overall Automation Engine performance

search cancel

How to improve overall Automation Engine performance

book

Article ID: 88220

calendar_today

Updated On: 04-18-2024

Products

CA Automic Workload Automation - Automation Engine CA Automic One Automation

Issue/Introduction

One or more of the following symptoms can be observed:

A performance degradation when working with the system
Much slower than usual UI response times
Time out error message when trying to logon to the UI
Delayed job processing
A high load on the database server machine

This article deals with the most common and most successful measures to improve or restore database performance in order to resolve the above mentioned issues.

In case none of the following improves the situation, please get in touch with support. Have the logs and traces listed at the end of this document available.

Resolution

General information

At the core, the system is designed so that every WP/CP processes incoming tasks in a serial fashion. Therefore, a server process waits for the database to respond to issued statements infinitely. This being said it is clear that the overall system performance depends heavily on the ability of the database to respond to any database statement in a timely fashion. More simply: most customers start to experience impacts to the AE system performance at the latest when the average response time of SQL transactions exceeds 50 milliseconds.
With this in mind, it makes sense that the server processes log all database (DB) statements to the logfile if they take more than one second on the DB side. (see Investigation for details.)

Investigation

How to identify persistent underlying database performance issues?

Open a server log file and search for all lines that contain "U00003524" or "U00003525" (respectively "U0003524" or "U0003525" prior to V11.2). This isolates all DB statements that took more than 1 second to complete. Typically there's one at the beginning of each logfile for the initial database 'open', such as:

20170216/141721.352 - U00003524 UCUDB: ===> Time critical DB call! OPC: 'OPEN' time: '4:555.410.883'

Excluding the first one, a high number of such lines indicates an issue with database response times. More than one per hour should already be considered as excessive!

Performance degradation can be caused by one out of many possible issues. Please find below a list of the most common ones.

Possible Issues and Solutions

Resource consumption by database instances on the same hardware

As recommended within the Administration Handbook, the database instance hosting the AE system should not be limit in any way in regards to resources, that includes the existence of parallel databases on the same instance or even hardware.

Running on an old version of Automic Automation Engine

As performance improvements are brought to the product with both major releases and Service Packs, it is worth considering an update to the latest Service Pack.

Index fragmentation or broken indices

The more jobs are running per time on an AE system, the faster the indices within the database become fragmented. A regular index rebuild helps to keep the indices fast. How often the rebuild should be scheduled depends on the usage and load of the AE system. Once a day seems to be a good starting point for most of our customers.
Please refer to the corresponding chapter within the product documentation for more details, like statistics vs. indices.
We recommend discussing this matter with your DB-Admin. You might want to schedule the index rebuild at a time of low activity to avoid degraded performance while the rebuild is performed. Please be aware that a certain edition of the database system is required to perform the index rebuild online. In case the required database license is not available, an index might be taken offline during rebuild which causes poor performance on the corresponding table.

For Microsoft SQL Server the following KBE contains procedures, suggested by Microsoft, on how to check the fragmentation growth of indexes: http://msdn.microsoft.com/en-us/library/ms189858.aspx

Many customers report a positive experience with a regular index rebuild on all tables, immediately after the Automic REORG-Run.

Dead connection detection (Oracle)

Oracle contains specific settings that need to be set in a certain way when working with Automic.

If the dead connection detection is not set, or set the wrong way, a hung transaction may occur that will go undetected for hours. This situation usually leads to a production outage.

Find out more information about Oracle specific settings within the corresponding chapter of the documentation. Furthermore, details can be found in the AE DB Preparation documentation. Although the whitepaper does not explicitly cover Oracle 12g or Automation Engine V12, the basic principles still apply.

Deadlock detection

Given the nature of distributed processing of tasks, deadlocks might occur from time to time. Even though Automic continously improves the architecture, database deadlocks can occur. Typically the database is capable of recognizing and resolving such situations automatically. However, it is important to make sure that the database resolves deadlocks as quickly as possible.
Please talk to your DB Admin to set the corresponding parameter for the DBMS that is used by the Automation Engine.

for Oracle: _lm_dd_interval set to a value <= 10 (refer to article: Preparing the AE Database - Oracle (Oracle RAC) for details)
for MS SQL: make sure READ_COMMITTED_SNAPSHOT is set to ON (refer to documentation of versioning (Activation of versioning (READ_COMMITTED_SNAPSHOT => ON) on MS SQL Server for details)
for IBM DB2: make sure to set this as low as possible (refer to KB System Slowness and DEADLOCKS: DB/2 for details)

Oracle RAC

When using Oracle RAC, we recommend using only one node for communication. Using additional nodes can cause performance issues. Please see article: Preparing the AE Database - Oracle (Oracle RAC) for more information.

Connection interruptions due unresponding components or unstable network

As part of the high availability concept the Automic Automation Engine components exchange heartbeat and keep-alive signals in a configurable frequency. In case of network interruptions the system executes a series of measures to ensure continued availability and processing.
For details on this topic please see: What is "Keep Alive" and how does it work?

An important technical detail to keep in mind: All network communications for all components of an Automic Automation Engine system are performed via the network stack or API provided by the operating system which the particular component is running on. Error messages within the AE log files match the responses the component gets from the OS.

To see if any network issues did occur, a search for the message numbers listed below can be performed within the server or agent logfiles

examples of network stack related log messages
Message Number since V11.2	Message Number prior to V11.2	Category	Example
U00003413	U0003413	network stack error message	U00003413 Socket call 'bind' returned error code '10048'. or U00003413 Socket call 'recv' returned error code '10053'.
U00003487	U0003487	application error due to previous network error	U00003487 ListenSocket with port number '10100' could not be created.

Another important aspect is the following: Network issues are not necessarily caused by active network components like routers or switches. It is more likely that an issue is caused by the network stack on the target or destination machine. Even if both communication partners reside on the same host machine, TCP/IP is utilized and might cause issues in certain scenarios.

Performance degradation on storage sub-system

Irrespective of the amount of DBMS processing done in RAM, ultimately data needs to be written to the filesystem to make it persistent. This is typically triggered by issuing a 'commit' command. Like all other DB statements the timing for such commands is monitored by the Automation Engine server processes.
The average value for commits ('CMIT' in AE logs) should not exceed 5 milliseconds. On high-performance systems with >1 million executions per day the requirements would be >1ms !

Similarly to other time critical DB responses a line is generated within the corresponding log file, in the event a CMIT takes more than 1 second:

U00003524 UCUDB: ===> Time critical DB call! OPC: 'CMIT' time: '2:294.465.288'

If there are long commit calls being made to the database, an admin must look at the system to see why this is happening

Large amount of data in database

In short, the amount of data within a database should not be an issue. In fact, some customers are running high performance environments with >1TB of data. Every modern DBMS system should be able to handle such volumes without problems.

However, the amount of data that needs to be 'reorganized' (removed from the database using the Automic Reorg-Utilities) influences the time it takes for one run of the housekeeping jobs to execute. While housekeeping is active, you might notice a degradation in overall system performance.
To minimize this, the run can be scheduled more frequently or the ILM feature can be used instead: Maintaining Data Records. When a full run of the Automic Reorg-Utilities takes more than 6 hours, increasing the frequency or switching to ILM should be considered.

Database Integrity Checks

In some cases, running Database Integrity Checks at a period of high activity have been noted to decrease performance. Check to see if any external maintenance activities are being accomplished by the DBA during the time of the performance degradation.

DB Performance check in 21.0.9+ and 24.0+

In 21.0.9 and above (including 24.0.0 and above), there is a performance check that runs hourly that cannot be disabled. Reviewing the information included with this check could help in analyzing performance. Please see the following for more information: Information on Client 0 System has performance issues: DB performance

In case none of the above leads to an improved situation

Please create a trace of the WPs with flag database set to 2 (database=2) and open a case for "Performance Analysis" with Support. Be sure to gather all the logs and traces from all WPs on all nodes of the system. Ideally the trace is created during a time of performance degradation and the sum of the sizes of all trace files from all WP instances does not exceed 3GB.

Note: Be sure to share the impact and urgency of these performance issues with us, so that we can prioritize accordingly.

Feedback

thumb_up Yes

thumb_down No