Frequent job terminations and KILLJOB behavior in event_demon logs
search cancel

Frequent job terminations and KILLJOB behavior in event_demon logs

book

Article ID: 409270

calendar_today

Updated On:

Products

Autosys Workload Automation

Issue/Introduction

A job is frequently being killed and don't see why other than its box has the n(boxjob), but no term_run_time is specified. It only has a box & job_terminator set.

While looking into the frequent job termination, noticed a strange behavior in AutoSys. This job, along with other jobs show a "connected for KILLJOB <jobname>" entry in the event_demon logfiles, as far back as 2023 (the oldest logs available). 

Is this "KILLJOB..." message 'normal'? Need help in determining why this job is being terminated all the time along with understanding this KILLJOB msg being tagged to a jobname during the agent connection msg.

[09/03/2025 21:03:03]      CAUAJM_I_10082 [<MACHINE_NAME> connected for KILLJOB xxxxxxxxxxxxxxxa 132.9819.1]
[09/03/2025 21:03:11]      CAUAJM_I_10082 [<MACHINE_NAME> connected for KILLJOB xxxxxxxxxxxxxxxb 132.9820.1]
[09/03/2025 21:03:18]      CAUAJM_I_10082 [<MACHINE_NAME> connected for KILLJOB xxxxxxxxxxxxxxxc 132.9822.1]
[09/03/2025 21:03:26]      CAUAJM_I_10082 [<MACHINE_NAME> connected for KILLJOB xxxxxxxxxxxxxxxd 132.9824.1]

Cause

When a self-looping box is terminated (e.g., manually killed or reaches a TERMINATED state due to internal conditions), the scheduler issues KILLJOB events to clean up any outstanding jobs or instances from that specific run. However, if the box is configured to restart immediately upon SUCCESS or TERMINATED status, it can restart before all these KILLJOB events are fully processed. When the scheduler then processes these delayed KILLJOB events, they can inadvertently target and terminate jobs in the new, already-started iteration of the box, leading to a cycle of unexpected looping terminations.

 

Resolution

To prevent this, introduce a controlled delay into the box definition. This ensures the scheduler has sufficient time to process all pending KILLJOB events from a previous run before the box's next iteration begins.

Example Box Definition with Delay:

insert_job: 60sec_job
job_type: cmd
condition: t(outer_box)
command: sleep 60
machine: localhost
description: This job runs for 60 seconds when outer_box terminates.

insert_job: outer_box
job_type: box
condition: s(outer_box) | d(60sec_job,0)
description: This box self-loops on success, or waits for a 60-second delay after termination.

 

Explanation:

  1. 60sec_job: This acts as a delay mechanism. It is configured to run for 60 seconds only when outer_box transitions to a TERMINATED state (t(outer_box)).
  2. outer_box condition:
    • s(outer_box): Allows the box to self-loop immediately if it completes successfully.
    • d(60sec_job,0): If outer_box is terminated, this condition becomes active. The 0 (lookback value) is crucial here; it means outer_box will wait for the next successful completion of 60sec_job.
  3. How it works: If outer_box is terminated, 60sec_job is triggered. outer_box will then pause and wait for 60sec_job to complete its 60-second sleep. This introduced delay provides ample time for the scheduler to process all outstanding KILLJOB events from the previous run of outer_box before the box begins its next iteration, thereby preventing premature termination.