We noticed that recurring jobs were triggered multiple times. Job names are as follows:
XXXX.SLEEPING_MAAS_XXXXXX
Release : 12.3.5
There was an outage in automic during 12:15 - 12:29 on 20221111
Then it recovered on its own and then the same job got triggered 3 times in the same minute.
Trace 24 - WPsrv_log_026_03.txt
Trace 25 - WPsrv_log_028_03.txt
Trace 48 - WPsrv_log_053_03.txt
Trace 54 - WPsrv_log_059_03.txt
Trace 65 - WPsrv_log_071_03.txt
54 - 20221111/124015.331 - U00007000 'XXXX.SLEEPING_MAAS_XXXXXX' activated with RunID '1003151452'.
25 - 20221111/124028.462 - U00007000 'XXXX.SLEEPING_MAAS_XXXXXX' activated with RunID '1003175099'.
48 - 20221111/124029.344 - U00007000 'XXXX.SLEEPING_MAAS_XXXXXX' activated with RunID '1003169156'.
24 - 20221111/124850.085 - U00011002 Job 'XXXX.SLEEPING_XXXXXX' (RunID '1003151452') on Host 'YYYYY' ended normally.
24 - 20221111/124858.394 - U00011002 Job 'XXXX.SLEEPING_XXXXXX' (RunID '1003169156') on Host 'YYYYY' ended normally.
65 - 20221111/124858.449 - U00011002 Job 'XXXX.SLEEPING_XXXXXX' (RunID '1003175099') on Host 'YYYYY' ended normally.
This is an error that was caused by the system instability & slowdown. In a normally running system, this can not happen.
The delayed processing of the messages in the MQWP queues (2 and a half minutes) caused the double execution. This will be fixed in a future release.
The problem was also due to the job design and can be corrected as follows.
Scenario 1:
A way to limit can be to set maxparalell=1 per job level, so that a job can not run in parallel,
Scenario 2:
Change the period execution from 'intervals of every XXX mins'
to
'after the previous execution ends plus XXX'