search cancel

Job running when not scheduled - unexpected RUNNING event when job is already complete

book

Article ID: 243834

calendar_today

Updated On:

Products

CA Workload Automation AE

Issue/Introduction

Job saw extra RUNNING event long after the job had gone to SUCCESS.

Please explain how this can happen.

Environment

Release : 11.3.6

Component : CA Workload Automation AE (AutoSys)

Resolution

The issue is typically, a past run got stuck, a user force started the job to get things going again,
and then later the agent is restarted and the agent then send back the older status from the
run that got stuck.  Support needs the raw events from the db (ujo_proc_event) for the job
in question and the agent logs and event_demon logs to review and confirm.

If the events are still in the database we can confirm this by running
a query to list all the events for that job, ordering by event_time_gmt 
(time it was to be processed) and including the run_num.  
We expect to see an older run_num for the unexpected running event.

select 
eoid||','||event||','||event_time_gmt||','||evt_num||','||
job_ver||','||joid||','||mach_name||','||ntry||','||orig_evt_num||','||
over_num||','||que_status||','||
to_char(que_status_stamp,'mm/dd/yyyy hh24:mi:ss')||','||run_num||','||status
from aedbadmin.ujo_proc_event where joid = <joid_for_job_in_question> 
order by event_time_gmt;

If the events have already been archived you can check the 
$AUTOUSER/archive/archived_events.$AUTOSERV files that were created after the day in question, grep for the joid.  

We can try to confirm the run_nums that way.

If we we can confirm the run_num is/was old then we track down when that run 
occurred originally.  You would need to check for the <joid>.<run_num> in the 
older event_demon logs.  Once we have that file we can see what occurred.
Did the job get stuck in starting?  was there a problem communicating with the agent?
did someone manually toggle the status of the job or force start the stuck job (most likely the case)?
Was it on a day when some dr testing was being done etc...

The last part is typically more difficult to determine... why did the agent
not send the event originally.  And for that we would need the event_demon log and 
agent logs from the day of when that specific <joid>.<older run_num> occurred and 
check its transmitter logs.
Most times the agent logs are long since over-written and so we end up running out of
evidence to review.