More jobs are executing than the DQM limit allows
search cancel

More jobs are executing than the DQM limit allows

book

Article ID: 91859

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

The DQM job limit is used to limit the number of jobs executing in parallel.
In some cases, the number of job executions in parallel exceeds the set limit, for example 2 executions instead of 1. 

Another symptom is that the counters displayed for JobExe and JobPend are wrong ( displaying -1 for JobExe).

If this is the case, the number of executions in UVC> Monitoring > Batch Queue Status, in the column "Pending" displays 1 instead of 0 when no Jobs are queued.

IMPORTANT: When no jobs are executing in the queue, the number in the column "Running" shows -1 (minus one) . 
Or, in the command line the command "uxlstque or uxshwque queue=<queue_name> full" show -1 in the column JOBEXE.

Example:
In the command line uxshwque, we can observe the issue ( counter -1): 

user@server:/automic/DUAS/node/bin#./uxshwque queue=SYS_BATCH
Queue SYS_BATCH
      JobLim 150
      JobQue 0 , JobExe -1 , JobHld 0 , JobPend 1

user@server:/automic/DUAS/node/bin#./uxlstque queue=SYS_BATCH

QUEUE NAME                      TYPE STA  JOBLIM  JOBQUE  JOBEXE  JOBHLD JOBPEND
--------------------------------------------------------------------------------
SYS_BATCH                       PHYS ON      150       0      -1       0       1

In UVC - Batch Queues Status, the counter Pending displays 1:

In UVC - Queued Jobs, no Pending Jobs are displayed :

Another symptom is that the counter JobPend displays a wrong (higher) value.
 
Other symptom would be this error appearing continously (every 30s)  in the universe.log after a server crash
|ERROR|X|DQM|pid=p.t| u_dqm_vrf_job             | kill(PID_JOB, 0) - EPERM

Where that PID_JOB appears in the file /duas_folder/data/u_jobfile.dta position 298:

 

Environment

Dollar Universe node using DQM Job Limits.

Cause

A problem has been fixed where DQM queue counters are wrong (executing is -1 or pending wrongly high) if job terminations and job submissions occur at the same time.

Resolution

Workaround

In case the message appears you need to reinitialize the affected DQM queue/s. 

1. In the command line execute the following command after having loaded the Dollar Universe environment:
uxresetque queue=<queue_name> 

2. In order to avoid the issue from happening again, increase the "DQM send cycle" value on the node where the Logical queue resides: 
Node Settings - DQM Settings:

DQM send cycle  ->  120 -> increase to 864000

Save and restart the node to take into account the modifications.

Solution

The issue should no longer occur in normal circomnstances if the node is on version equal or superior to 6.10.01 but it will be necessary to apply the workaround if any of the symptoms and errors described in the article appear in the universe.log

Additional Information

Since version 6.10.01, the following kind of Error will be displayed in the log to inform about "negative counter" problem, nothing to worry about:

 |ERROR|X|DQM|pid=p.t| u_dqm_upd_numbers_que     | u_dqm_end_job: [ENTRY:XXXX] Nbexe=(-1) is negative for queue: NAME_OF_QUEUE

Or, in case of system crash this error could appear every 30s in the log:

|ERROR|X|DQM|pid=p.t| u_dqm_vrf_job             | kill(PID_JOB, 0) - EPERM

In case this occurs, stop the Launcher when no Jobs are Running and reset the impacted queue as explained in the Resolution.