Error Message :
Not able to log into Automation engine (AE) during the switch to Daylight Savings Time (DST). No jobs are running.
The primary worker process (WP) is not running, but the cause is unknown but believed to be Hardware or Software related. Troubleshooting Method
- Need to know what the process is doing (CPU/RAM usage) and within the database when issue is occurring.
- Check to see if the WP is running. Can another WP become Primary Worker Process (PWP) on this system?
- If another WP is unable to become PWP before reaching the keep alive timeout, check to see if the IP socket is still in use?
- Check all systems after the keep alive timeout (i.e. 10 minutes) to see if any other WP is able to become PWP on any sytem after the keep alive timeout.
If another WP is able to become a PWP, a deadlock may occurr based on the timing and the time change.
WP#4 is able to become the PWP on a system after the keep alive timout.
The new role of the WP#4 process is to act as the new PWP and execute the same transactions as the previous PWP (i.e. WP#1 on the first system).
In the WP log file it is clear that this process was in the normal TIMER processing (like the original WP#1). Therefore, the process must lock rows within the database (DB) to ensure everything is correct. When that happens, because WP#1 was originally in this transaction, the new PWP (WP#4) may run into a lock from WP#1. If using MS SQL Server, this lock would have been seen in the MS SQL SERVER BLOCKING SESSION report.
Once WP#1 was killed the DB transacton was rolled back and the holding locks are released. Only then could the new PWP (WP#4) continue processing. Due to the DB lock from the WP#1 process and the time change going on at the same time, WP#4 may end.
In the above example, we are unable to determine why the WP#1 stopped without detailed information during the first outage, the failure of the PWP (WP#1). We can only guess that the time change may be an impacting factor but cannot prove that without detailed information.
In order to tell why the original PWP (WP#1) stopped and the new PWP (WP#4) ended during the time change the information listed below would need to be gathered an analyzed. Information required for analysis:
- Check CPU usage of CP/WP process - Outcome
- Check RAM usage of CP/WP process - Outcome
- Check log files of CP/WP process - Outcome
- Check overall CPU usage on machines involved - Outcome
- Check DB server for CPU usage - Outcome
- Check DB server for RAM usage - Outcome
- Check DB for Blocking sessions - Outcome
- Check DB vendor related MUST settings, for example:
- SNAPSHOT ISOLATION /LOCK ESCALATION MS SQL
- TEMPDB must use as many files as available CPU’s on DB server
- ORACLE check REDO file size and how often a switch occurs
- ORACLE check SGA settings – min. is 16GB
- DB2 LUW – check all settings from our documentation – MUST be set
- If DEADLOCK’s occur in DB2 don’t panic – just run RUNSTATS once on the involved tables