At the core, the system is designed so that every WP/CP processes incoming tasks in a serial fashion. Therefore, a server process waits for the database to respond to issued statements infinitely. This being said it is clear that the overall system performance depends heavily on the ability of the database to respond to any database statement in a timely fashion. More simply: most customers start to experience impacts to the AE system performance at the latest when the average response time of SQL transactions exceeds 50 milliseconds.
With this in mind, it makes sense that the server processes log all database (DB) statements to the logfile if they take more than one second on the DB side. (see Investigation for details.)
How to identify persistent underlying database performance issues?
Open a server log file and search for all lines that contain "U00003524" or "U00003525" (respectively "U0003524" or "U0003525" prior to V11.2). This isolates all DB statements that took more than 1 second to complete. Typically there's one at the beginning of each logfile for the initial database 'open', such as:
20170216/141721.352 - U00003524 UCUDB: ===> Time critical DB call! OPC: 'OPEN' time: '4:555.410.883'
Excluding the first one, a high number of such lines indicates an issue with database response times. More than one per hour should already be considered as excessive!
Performance degradation can be caused by one out of many possible issues. Please find below a list of the most common ones.
Possible Issues and Solutions
- Resource consumption by database instances on the same hardware
As recommended within the Administration Handbook, the database instance hosting the AE system should not be limit in any way in regards to resources, that includes the existence of parallel databases on the same instance or even hardware.
- Running on an old version of Automic Automation Engine
As performance improvements are brought to the product with both major releases and Service Packs, it is worth considering an update to the latest Service Pack.
- Index fragmentation or broken indices
The more jobs are running per time on an AE system, the faster the indices within the database become fragmented. A regular index rebuild helps to keep the indices fast. How often the rebuild should be scheduled depends on the usage and load of the AE system. Once a day seems to be a good starting point for most of our customers.
Please refer to the corresponding chapter within the product documentation for more details, like statistics vs. indices.
We recommend discussing this matter with your DB-Admin. You might want to schedule the index rebuild at a time of low activity to avoid degraded performance while the rebuild is performed. Please be aware that a certain edition of the database system is required to perform the index rebuild online. In case the required database license is not available, an index might be taken offline during rebuild which causes poor performance on the corresponding table.
For Microsoft SQL Server the following KBE contains procedures, suggested by Microsoft, on how to check the fragmentation growth of indexes: http://msdn.microsoft.com/en-us/library/ms189858.aspx
Many customers report a positive experience with a regular index rebuild on all tables, immediately after the Automic REORG-Run.
- Dead connection detection (Oracle)
Oracle contains specific settings that need to be set in a certain way when working with Automic.
If the dead connection detection is not set, or set the wrong way, a hung transaction may occur that will go undetected for hours. This situation usually leads to a production outage.
Find out more information about Oracle specific settings within the corresponding chapter of the documentation. Furthermore, details can be found in the AE DB Preparation documentation. Although the whitepaper does not explicitly cover Oracle 12g or Automation Engine V12, the basic principles still apply.
Given the nature of distributed processing of tasks, deadlocks might occur from time to time. Even though Automic continously improves the architecture, database deadlocks can occur. Typically the database is capable of recognizing and resolving such situations automatically. However, it is important to make sure that the database resolves deadlocks as quickly as possible.
Please talk to your DB Admin to set the corresponding parameter for the DBMS that is used by the Automation Engine.
- Connection interruptions due unresponding components or unstable network
As part of the high availability concept the Automic Automation Engine components exchange heartbeat and keep-alive signals in a configurable frequency. In case of network interruptions the system executes a series of measures to ensure continued availability and processing.
For details on this topic please see: What is "Keep Alive" and how does it work?
An important technical detail to keep in mind: All network communications for all components of an Automic Automation Engine system are performed via the network stack or API provided by the operating system which the particular component is running on. Error messages within the AE log files match the responses the component gets from the OS.
To see if any network issues did occur, a search for the message numbers listed below can be performed within the server or agent logfiles
Another important aspect is the following: Network issues are not necessarily caused by active network components like routers or switches. It is more likely that an issue is caused by the network stack on the target or destination machine. Even if both communication partners reside on the same host machine, TCP/IP is utilized and might cause issues in certain scenarios.
examples of network stack related log messages
Message Number since V11.2
Message Number prior to V11.2
network stack error message
U00003413 Socket call 'bind' returned error code '10048'.
U00003413 Socket call 'recv' returned error code '10053'.
application error due to previous network error
U00003487 ListenSocket with port number '10100' could not be created.
- Performance degradation on storage sub-system
Irrespective of the amount of DBMS processing done in RAM, ultimately data needs to be written to the filesystem to make it persistent. This is typically triggered by issuing a 'commit' command. Like all other DB statements the timing for such commands is monitored by the Automation Engine server processes.
The average value for commits ('CMIT' in AE logs) should not exceed 5 milliseconds. On high-performance systems with >1 million executions per day the requirements would be >1ms !
Similarly to other time critical DB responses a line is generated within the corresponding log file, in the event a CMIT takes more than 1 second:
If there are long commit calls being made to the database, an admin must look at the system to see why this is happening
U00003524 UCUDB: ===> Time critical DB call! OPC: 'CMIT' time: '2:294.465.288'
- Large amount of data in database
In short, the amount of data within a database should not be an issue. In fact, some customers are running high performance environments with >1TB of data. Every modern DBMS system should be able to handle such volumes without problems.
However, the amount of data that needs to be 'reorganized' (removed from the database using the Automic Reorg-Utilities) influences the time it takes for one run of the housekeeping jobs to execute. While housekeeping is active, you might notice a degradation in overall system performance.
To minimize this, the run can be scheduled more frequently or the ILM feature can be used instead: Maintaining Data Records. When a full run of the Automic Reorg-Utilities takes more than 6 hours, increasing the frequency or switching to ILM should be considered.
Database Integrity Checks
In some cases, running Database Integrity Checks at a period of high activity have been noted to decrease performance. Check to see if any external maintenance activities are being accomplished by the DBA during the time of the performance degradation.
In case none of the above leads to an improved situation
Please create a trace of the WPs with flag database set to 2 (database=2) and open a case for "Performance Analysis" with Support. Be sure to gather all the logs and traces from all WPs on all nodes of the system. Ideally the trace is created during a time of performance degradation and the sum of the sizes of all trace files from all WP instances does not exceed 3GB.
Note: Be sure to share the impact and urgency of these performance issues with us, so that we can prioritize accordingly.