Root cause investigation for Automation Engine outage / freeze / unavailability
search cancel

Root cause investigation for Automation Engine outage / freeze / unavailability

book

Article ID: 88388

calendar_today

Updated On:

Products

CA Automic Workload Automation - Automation Engine

Issue/Introduction

Root cause investigation for Automation Engine outage / freeze / unavailability

This article should help to determine the root cause for Automation Engine outages, seen as a kind of freezes or unavailability of the Engine. At least it should help you to narrow down the possible root causes. Furthermore it should help to prepare the required information, in case a ticket with product support is opened.
 

Symptoms and characteristics

The following symptoms or characteristics are mandatory for Automation Engine outages covered in this article:
  1. Server processes (CP, WP, JWP) are up and running on operating system level.
    That means no process crashed or terminated unwanted on the application server(s).
  2. Some or all server processes doesn’t issue messages into their log file any longer. Or they print the same messages, or block of messages, over and over again.
  3. Processing of tasks stopped completely or almost completely.
    That means no more jobs, workflows, etc. are activated, generated or continued.
 
The following symptoms or characteristics may occur as well, however they are not mandatory for situations covered by this article:
  • Users are unable to logon to the system.
    They get an error message, timeout or even no response from the Engine.
  • Agents are disconnecting and may reconnect with or without success.
  • High CPU usage of one (or more) server processes (CP, WP, JWP).
  • High memory consumption of one (or more) server processes (CP, WP, JWP).
 

Investigation steps

Evaluate if the Server processes (CP, WP, JWP) are still up and running on operating system level.

This is according to symptom a. described above.
  • In case the Service Manager is used, it’s very easy using the Service Manager Dialog to check that quickly. Its log file might be also a good place to check on what happened to the processes in the past.
    (!) The Service Manager log files are required for a support ticket.
  • Otherwise the in-house mechanism can be hopefully used therefore. Of course the operating system commands or tools can be used as well.
    For instance the “Task Manager” and “Event Viewer” on Windows platforms or the “ps” command on UNIX/Linux, etc.
    (!) For a support ticket, provide how this was carried out and provide results.
Note: In case of a multi-node setup (Server processes running on two or more application server), the above needs to be done on every node, of course!
 

Check if the Server processes (CP, WP, JWP) are still issuing messages into their log files.

This is according to symptom b. described above.
Check if some or all server processes have stopped logging at a specific timestamp. Or if the same messages or block of messages is logged over and over again.
This can be done easily with any file viewer or editor. Within Automic the preferred tool therefore is the so called “RSView” tool. It can be found at the Automation Engine image delivery in the folder “Tools\no_supp” as well.
(!) The log files of all Server processes (CP, WP, JWP) from all nodes are required for a support ticket.
For V12 upgrades, ensure that the user starting the processes has an environment that references Java 8 *first* (before Java 7, if installed), and that Java 7 is not being utilized. If this is not the case, the JWP/JCP processes may not start or continue running. 
Note: In case Server processes are distributed on two or more application server this has to be done on each of them!
 

Verify if the processing of tasks has stopped completely or almost completely

This is according to symptom c. described above.
In case logging on to the system via the User Interface is still possible, observe in the Process Monitoring perspective (Activity Window) if processing of tasks has stopped completely or almost completely.
In case of symptom b. was already detected it’s most likely, that the processing is malfunctioning.
 

Determine if there are any locks on the database

In case at least the three “mandatory” symptoms could be located, it’s often a lock in the database causing the outage. Therefore it’s very important to find out, which kind of lock exists and which database session is the top locker / holds the lock.
 
At this stage of investigation the database administrator should be contacted! The DBA knows how to determine that quickly and properly.
 
However here an example how this can be done on Oracle databases.
The example is based on reference: https://www.oraclerecipes.com/monitoring/find-blocking-sessions.
Basically all 3 recipes brings the same result in different formats:
-- http://www.oraclerecipes.com/monitoring/find-blocking-sessions/
-- Recipie #1 - find blocking sessions with v$session
--
SELECT
  s.blocking_session, s.sid, s.serial#, s.seconds_in_wait
FROM
  v$session s
WHERE
  blocking_session IS NOT NULL;
--
-- Recipie #2 - find blocking sessions using v$lock
--
SELECT
  l1.sid || ' is blocking ' || l2.sid blocking_sessions
FROM
  v$lock l1, v$lock l2
WHERE
  l1.block = 1 AND
  l2.request > 0 AND
  l1.id1 = l2.id1 AND
  l1.id2 = l2.id2;
--
-- Recipie #3 - blocking sessions with all available information
--
SELECT
  s1.username || '@' || s1.machine
  || ' ( SID=' || s1.sid || ' )  is blocking '
  || s2.username || '@' || s2.machine || ' ( SID=' || s2.sid || ' ) ' AS blocking_status
  FROM v$lock l1, v$session s1, v$lock l2, v$session s2
  WHERE s1.sid=l1.sid AND s2.sid=l2.sid
  AND l1.BLOCK=1 AND l2.request > 0
  AND l1.id1 = l2.id1
  AND l1.id2 = l2.id2;
 
This one can be used to map the sid to the process id:
SELECT process, machine, s.osuser, s.program, sid
FROM
  v$process p, v$session s
WHERE
  p.addr = s.paddr
ORDER BY
  sid;
 

On Microsoft SQL databases there is the so called “Activity – All Blocking Transactions” report.
This is based on reference: https://support.microsoft.com/en-au/help/224453/inf-understanding-and-resolving-sql-server-blocking-problems

Note: This investigation might be only possible during the outage situation / DB lock persists.
 

Check the process which is related to the blocking session

Once the process, blocking the system (=holding the top blocking / locking session), was identified, it’s useful to find out what’s going on with that process, before killing the session or the process!
It’s most likely a server process (CP, WP, JWP), however it can be another application too.
 
At this stage of investigation the server administrator, responsible for the server were the process runs, should be contacted! The admin knows how to determine that quickly and properly.
 
However here some examples how this action can maybe performed:
  • On Windows server tools like “Task Manager”, “Resource Manager” or “Event Viewer” can be used to look up CPU usage, Memory consumption, File I/O, Network traffic or any system error messages of the process. Furthermore it’s also possible to “Debug” the process.
  • On UNIX / Linux server there mid be similar tools available, at least commands like “top” or “truss” can be used.
(!) Everything determined in this investigation stage is a very helpful information when a support ticket is opened. It can help to find the real root cause and hopefully also a permanent solution.

 

Environment


 

Resolution

Resolution of the incident


Once all data for a detailed root cause investigation are backed up, the first attempts to restore the service can be done.
Which action this could be, depends on the result of the investigation steps above.
Possible ones are: stopping the process which locks; killing the blocking session; restarting the database; restarting the application; restarting the application server; etc.
 

Further debugging in case server processes caused the lock


In case an Automation Engine Server process (CP, WP, JWP) caused the lock, it might be necessary to create a trace of the Server process, this trace needs to be started just before the situation occurs.
That means it’s most likely necessary to be able to reproduce the issue on purpose, the issue occurs in a known frequency or the issue persists.
In case of random occurrences of the issue it will be hard to generate the required traces due to the unpredictability. However, without these trace files a root-cause analysis is not feasible.
The initial trace level should be TCP/IP=2 and database=4.

(!) The Automation Engine log and trace files are required to open a support ticket.