We have jobs failing with a "Lost control" message. What does it mean?
jr TEST_JOB -d -r -1
Job Name Last Start Last End ST Run/Ntry Pri/Xit
________________________________________________________________ ____________________ ____________________ __ ________ _______
TEST_JOB 08/11/2016 19:04:54 08/11/2016 20:40:56 SU 8280804/1 0
Status/[Event] Time Ntry ES ProcessTime Machine
-------------- --------------------- -- -- --------------------- ----------------------------------------
STARTING 08/11/2016 19:04:54 1 PD 08/11/2016 19:04:54 WA_AGENT
RUNNING 08/11/2016 19:04:54 1 PD 08/11/2016 19:04:55 WA_AGENT
<Executing at WA_AGENT>
SUCCESS 08/11/2016 20:40:56 1 PD 08/11/2016 20:40:57 WA_AGENT
RUNNING 08/12/2016 09:42:01 1 PD 08/12/2016 09:42:02
FAILURE 08/12/2016 09:42:01 1 PD 08/12/2016 09:42:03
<Lost control>
[*** ALARM ***]
JOBFAILURE 08/12/2016 09:42:03 1 PD 08/12/2016 09:42:04 WA_AGENT
A 'Lost Control' message means that the Agent can no longer find the Process ID (PID) of the spawned process that is running a job. This usually happens when the agent is restarted and the process exited while the agent was stopped.
When a job starts, the agent creates/updates an odb file in its "database" directory - this contains the job status. If the agent is stopped (or crashed), these files are checked when the agent restarts. This allows the agent to re-synchronize with the scheduler and send the correct status. By default, the agent does a warm start and makes this checking.
This agent behavior may be customized by modifying the agentparm.txt:
The oscomponent.noguardianprocess tells the agent to return the status of the jobs that were either active or inactive when the agent was recycled. Setting this parameter to false will not return this status. If the agent does a cold start (persistence.coldstart=true) then all the information in the base is deleted, no messages will be sent and oscomponent.noguardianprocess parameter is ignored.