General information
At the core, the system is designed so that every WP/CP processes incoming tasks in a serial fashion. Therefore, a server process waits for the database to respond to issued statements infinitely. This being said it is clear that the overall system performance depends heavily on the ability of the database to respond to any database statement in a timely fashion. More simply: most customers start to experience impacts to the AE system performance at the latest when the average response time of SQL transactions exceeds 50 milliseconds.
With this in mind, it makes sense that the server processes log all database (DB) statements to the logfile if they take more than one second on the DB side. (see Investigation for details.)
Investigation
How to identify persistent underlying database performance issues?
Open a server log file and search for all lines that contain "U00003524" or "U00003525" (respectively "U0003524" or "U0003525" prior to V11.2). This isolates all DB statements that took more than 1 second to complete. Typically there's one at the beginning of each logfile for the initial database 'open', such as:
20170216/141721.352 - U00003524 UCUDB: ===> Time critical DB call! OPC: 'OPEN' time: '4:555.410.883'
Excluding the first one, a high number of such lines indicates an issue with database response times. More than one per hour should already be considered as excessive!
Performance degradation can be caused by one out of many possible issues. Please find below a list of the most common ones.
Possible Issues and Solutions
- Resource consumption by database instances on the same hardware
As recommended within the Administration Handbook, the database instance hosting the AE system should not be limit in any way in regards to resources, that includes the existence of parallel databases on the same instance or even hardware.
- Running on an old version of Automic Automation Engine
As performance improvements are brought to the product with both major releases and Service Packs, it is worth considering an update to the latest Service Pack.
- Index fragmentation or broken indices
The more jobs are running per time on an AE system, the faster the indices within the database become fragmented. A regular index rebuild helps to keep the indices fast. How often the rebuild should be scheduled depends on the usage and load of the AE system. Once a day seems to be a good starting point for most of our customers.
Please refer to the corresponding chapter within the product documentation for more details, like statistics vs. indices.
We recommend discussing this matter with your DB-Admin. You might want to schedule the index rebuild at a time of low activity to avoid degraded performance while the rebuild is performed. Please be aware that a certain edition of the database system is required to perform the index rebuild online. In case the required database license is not available, an index might be taken offline during rebuild which causes poor performance on the corresponding table.
For Microsoft SQL Server the following KBE contains procedures, suggested by Microsoft, on how to check the fragmentation growth of indexes: http://msdn.microsoft.com/en-us/library/ms189858.aspx
Many customers report a positive experience with a regular index rebuild on all tables, immediately after the Automic REORG-Run.
- Dead connection detection (Oracle)
Oracle contains specific settings that need to be set in a certain way when working with Automic.
If the dead connection detection is not set, or set the wrong way, a hung transaction may occur that will go undetected for hours. This situation usually leads to a production outage.
Find out more information about Oracle specific settings within the corresponding chapter of the documentation. Furthermore, details can be found in the AE DB Preparation documentation. Although the whitepaper does not explicitly cover Oracle 12g or Automation Engine V12, the basic principles still apply.
Given the nature of distributed processing of tasks, deadlocks might occur from time to time. Even though Automic continously improves the architecture, database deadlocks can occur. Typically the database is capable of recognizing and resolving such situations automatically. However, it is important to make sure that the database resolves deadlocks as quickly as possible.
Please talk to your DB Admin to set the corresponding parameter for the DBMS that is used by the Automation Engine.
- Connection interruptions due unresponding components or unstable network
As part of the high availability concept the Automic Automation Engine components exchange heartbeat and keep-alive signals in a configurable frequency. In case of network interruptions the system executes a series of measures to ensure continued availability and processing.
For details on this topic please see: What is "Keep Alive" and how does it work?
An important technical detail to keep in mind: All network communications for all components of an Automic Automation Engine system are performed via the network stack or API provided by the operating system which the particular component is running on. Error messages within the AE log files match the responses the component gets from the OS.
To see if any network issues did occur, a search for the message numbers listed below can be performed within the server or agent logfiles
Message Number since V11.2
|
Message Number prior to V11.2
|
Category
|
Example
|
examples of network stack related log messages
U00003413
|
U0003413
|
network stack error message
|
U00003413 Socket call 'bind' returned error code '10048'. or U00003413 Socket call 'recv' returned error code '10053'.
|
U00003487
|
U0003487
|
application error due to previous network error
|
U00003487 ListenSocket with port number '10100' could not be created.
|
Another important aspect is the following: Network issues are not necessarily caused by active network components like routers or switches. It is more likely that an issue is caused by the network stack on the target or destination machine. Even if both communication partners reside on the same host machine, TCP/IP is utilized and might cause issues in certain scenarios.
- Performance degradation on storage sub-system
Irrespective of the amount of DBMS processing done in RAM, ultimately data needs to be written to the filesystem to make it persistent. This is typically triggered by issuing a 'commit' command. Like all other DB statements the timing for such commands is monitored by the Automation Engine server processes.
The average value for commits ('CMIT' in AE logs) should not exceed 5 milliseconds. On high-performance systems with >1 million executions per day the requirements would be >1ms !
Similarly to other time critical DB responses a line is generated within the corresponding log file, in the event a CMIT takes more than 1 second:
U00003524 UCUDB: ===> Time critical DB call! OPC: 'CMIT' time: '2:294.465.288'
If there are long commit calls being made to the database, an admin must look at the system to see why this is happening
- Large amount of data in database
In short, the amount of data within a database should not be an issue. In fact, some customers are running high performance environments with >1TB of data. Every modern DBMS system should be able to handle such volumes without problems.
However, the amount of data that needs to be 'reorganized' (removed from the database using the Automic Reorg-Utilities) influences the time it takes for one run of the housekeeping jobs to execute. While housekeeping is active, you might notice a degradation in overall system performance.
To minimize this, the run can be scheduled more frequently or the ILM feature can be used instead: Maintaining Data Records. When a full run of the Automic Reorg-Utilities takes more than 6 hours, increasing the frequency or switching to ILM should be considered.
Database Integrity Checks
In some cases, running Database Integrity Checks at a period of high activity have been noted to decrease performance. Check to see if any external maintenance activities are being accomplished by the DBA during the time of the performance degradation.
- DB Performance check in 21.0.9+ and 24.0+
- SQL Server as a backend
- Mars Connection
Check to be sure that Mars_connection is used in the sqldriverconnect= setting in ucsrv.ini in the Automation Engine /bin directory and that the first character is an "N", not an "S". It should not look like this:
sqlDriverConnect=ODBCVAR=SNNNNNRN,DSN=AUTOMIC;UID=automic;PWD=???;Mars_Connection=Yes
but instead should be:
sqlDriverConnect=ODBCVAR=NNNNNNRN,DSN=AUTOMIC;UID=automic;PWD=???;Mars_Connection=Yes
- Check your ODBC driver version
Look in your WP logs for a line like:
U00003535 DB INFO: 'DBMS-NAME = Microsoft SQL Server'
Below it will be a line with a driver version:
U00003535 DB INFO: 'DRIVER-VER = 18.03.0002'
This is the driver version for ODBC and needs to be the most up to date possible based on your database version. Your database vendor should have a full list of available ODBC driver versions that are compatible with your database version. If you are not on the latest, you'll need to work with your DBA to ensure the latest is installed and used for the ODBC connection shown in your sqlDriverConnect= setting under DSN=.
- Check your JDBC driver version
Look in your WP logs for a line like:
U00003535 DB INFO: 'DBMS-NAME = AUTOMIC_DATABASE_NAME'
Note: this will have an actual database name, not just simply "Microsoft SQL Server".
Below it will be a line with a driver version:
U00003535 DB INFO: 'DRIVER-VER = 9.4.0.0'
This is the driver version for JDBC for the driver that is in the Automation Engine /bin/lib directory and needs to be the most up to date possible based on your database version. Your database vendor should have a full list of available JDBC driver versions that are compatible with your database version and version of java. If you are not on the latest, you'll need to work with your DBA to ensure the latest is installed and used for the JDBC driver in the engine's /bin/lib directory.
- DB Connection test on 24.3 and higher
Beginning with version 24.3.0, a connection test tool has been added to the automation engine /bin directory. This will perform the same test as the one that is run hourly in 21.0.9 and above and at startup of all WPs in all versions. You can run the test tool from the command line by cd'ing to the automation engine /bin directory and use the command:
ucdbconntest -?
which will show all options. To run a connection test against the same database the automationengine uses, you can simply run the command:
ucdbconntest -T2
This will create two files in the Automation Engine /temp directory, a log and a trace, that will show information about the connection test and can be used to show connection times as discussed in:
Performance problems and an explanation of U00003533
Information on Client 0 warnings DB performance or minimum requirements
Note: This tool is not a diagnostic tool, it will not provide a solution to performance issues, it is a troubleshooting tool to see how long a connection test takes (in general, it should be under 3 seconds) and should be used at the same time as database logging that a DBA analyzes to see where slowdowns occur.
In case none of the above leads to an improved situation
Please create a trace of the WPs with flag database set to 2 (database=2) and open a case for "Performance Analysis" with Support. Be sure to gather all the logs and traces from all WPs on all nodes of the system. Ideally the trace is created during a time of performance degradation and the sum of the sizes of all trace files from all WP instances does not exceed 3GB.
Note: Be sure to share the impact and urgency of these performance issues with us, so that we can prioritize accordingly.