The issue is typically, a past run got stuck, a user force started the job to get things going again,
and then later the agent is restarted and the agent then send back the older status from the
run that got stuck. Support needs the raw events from the db (ujo_proc_event) for the job
in question and the agent logs and event_demon logs to review and confirm.
If the events are still in the database we can confirm this by running
a query to list all the events for that job, ordering by event_time_gmt
(time it was to be processed) and including the run_num.
We expect to see an older run_num for the unexpected running event.
from aedbadmin.ujo_proc_event where joid = <joid_for_job_in_question>
order by event_time_gmt;
If the events have already been archived you can check the
$AUTOUSER/archive/archived_events.$AUTOSERV files that were created after the day in question, grep for the joid.
We can try to confirm the run_nums that way.
If we we can confirm the run_num is/was old then we track down when that run
occurred originally. You would need to check for the <joid>.<run_num> in the
older event_demon logs. Once we have that file we can see what occurred.
Did the job get stuck in starting? was there a problem communicating with the agent?
did someone manually toggle the status of the job or force start the stuck job (most likely the case)?
Was it on a day when some dr testing was being done etc...
The last part is typically more difficult to determine... why did the agent
not send the event originally. And for that we would need the event_demon log and
agent logs from the day of when that specific <joid>.<older run_num> occurred and
check its transmitter logs.
Most times the agent logs are long since over-written and so we end up running out of
evidence to review.