Jobs on virtual machine pools experience several minute evaluation latency

Products

Autosys Workload Automation

Issue/Introduction

You experience a several minute latency when starting several jobs assigned to a virtual machine pool ․‌‍

ERROR MESSAGE: "No errors during this time."

SYMPTOMS:

Box runs start on time
Jobs delay evaluation for 7 to 13 minutes instead of the normal 30 seconds
Jobs continue processing normally after the delayed evaluation

CONTEXT: Occurs during job evaluation on virtual machine pools configured with max_load settings

IMPACT: Service level agreements miss targets due to delayed job starts

Environment

AutoSys 12.x, 24.X

Resolution

The perceived delay is not due to a flaw in the scheduler's design but a limitation in the implementation of virtual machines and job loads.
In order for the scheduler to evaluate if a job can start on a component machine in the virtual machine, it needs to know what is the current load (the sum of the load of all jobs currently in the STARTING or RUNNING status) on the machine.
This means that, internally, the scheduler locks access to any other job start evaluations that may depend on the same virtual machine until the load is calculated, the job evaluation is completed, and the job is placed into STARTING which effectively increases the load on the component machine.
This implies that parallelism is not possible when evaluating job start conditions on jobs sharing the same virtual machine criteria.
If the scheduler must evaluate thousands of job starts to the same virtual machine, the job start evaluation requests will remain queued so that they are evaluated on a first-come, first-served basis.
This means the thousandth job start evaluation request in the virtual machine queue will evaluate when the 999 evaluations ahead of it evaluate.
The bigger the number of component machines in the virtual machine, the longer the scheduler may take to evaluate an individual job start request, especially if the machine method used to select a machine is based on CPU.
In that case, the scheduler must contact every agent on the virtual machine to collect their CPUs before the scheduler can choose a machine.
Imagine the scheduler making at least 10 round-trips to 10 agents in the virtual machine times 1000, and you can imagine how the final queued requests in a queue with thousands of queued requests can end up remaining queued for the order of minutes.

A couple of recommendations:

1) Avoid starting many, many jobs at the same time

2) Update the virtual machine definition to specify a machine_method of roundrobin, which is the fastest machine selection criteria

3) Split up virtual machines with tens of component machines into much smaller virtual machines and spread the jobs evenly across them.

4) Elevate the job priority to 1 to skip ahead of queued priority 2 jobs.

Additional Information

The difference between job_load and roundrobin_jobload machine methods is the following:

For job_load, the scheduler will query the job loads of all machines, then choose a machine at random from the list of machines with the least load.

For roundrobin_jobload, the scheduler will query the job loads of all machines, then choose the next machine with ANY available load beginning after the previous machine assignment.

If all the jobs are assigned the same amount of load units, then I don't expect the choosing of machines to be any more equitable under one or the other.
If the jobs vary in their load units, then job_load will be fairer than roundrobin_jobload, since the latter only considers the next machine with available load from where it left off.
In that case, the machine chosen may not be the machine with the least amount of load.

Suppose a virtual machine with 5 component machines, each having a max of 40 units and a burst of 6 job starts. If all the jobs are assigned 20 units

Eval 1 under job_load  -à [0, 20, 0, 0, 0] - select 1 at random from 5 machines having 0 load
Eval 2 under job_load  -à [0, 20, 0, 20, 0] - select 1 at random from 4 machines having 0 load
Eval 3 under job_load  -à [20, 20, 0, 20, 0] - select 1 at random from 3 machines having 0 load
Eval 4 under job_load  -à [20, 20, 0, 20, 20] - select 1 at random from 2 machines having 0 load
Eval 5 under job_load  -à [20, 20, 20, 20, 20] - select the only machine having 0 load
Eval 6 under job_load  -à [20, 20, 20, 40, 20] - select 1 at random from 5 machines having 20 load

Eval 1 under rr_jobload  -à [20, 0, 0, 0, 0] - select 1st machine
Eval 2 under rr_jobload  -à [20, 20, 0, 0, 0] - select 1st machine
Eval 3 under rr_jobload  -à [20, 20, 20, 0, 0] - select 1st machine
Eval 4 under rr_jobload  -à [20, 20, 20, 20, 0] - select 4th machine
Eval 5 under rr_jobload  -à [20, 20, 20, 20, 20] - select 5th machine
Eval 6 under rr_jobload  -à [40, 20, 20, 20, 20] - select 1st machine

If the 6 jobs are assigned 40, 30, 20, 10, 5, 1 units respectively

Eval 1 under job_load  -à [0, 0, 0, 40, 0] - select 1 at random from 5 machines having 0 load
Eval 2 under job_load  -à [0, 30, 0, 40, 0] - select 1 at random from 4 machines having 0 load
Eval 3 under job_load  -à [0, 30, 0, 40, 20] - select 1 at random from 3 machines having 0 load
Eval 4 under job_load  -à [0, 30, 10, 40, 20] - select 1 at random from 2 machines having 0 load
Eval 5 under job_load  -à [5, 30, 10, 40, 20] - select the only machine having 0 load
Eval 6 under job_load  -à [6, 30, 10, 40, 20] - select the machine having the least load (5 units)

Eval 1 under rr_jobload  -à [40, 0, 0, 0, 0] - select 1st machine
Eval 2 under rr_jobload  -à [40, 30, 0, 0, 0] - select 1st machine
Eval 3 under rr_jobload  -à [40, 30, 20, 0, 0] - select 1st machine
Eval 4 under rr_jobload  -à [40, 30, 20, 10, 0] - select 4th machine
Eval 5 under rr_jobload  -à [40, 30, 20, 10, 5] - select 5th machine
Eval 6 under rr_jobload  -à [40, 31, 20, 10, 5] - select the next machine with available load after cycling over (the 2nd machine)