search cancel

When a node is restarted, one or some launches go to aborted even though the script keeps running


Article ID: 201499


Updated On:


CA Automic Dollar Universe


When  a node is restarted, DQM can loose the reference to the PIDs of some of the running jobs. 

Impacted jobs end then as aborted in Dollar Universe, whereas they are still running on the system.

These messages can be found in the universe.log for each occurrence (no trace level activated!!). The timestamp reflects the end time of the job in the Console 'Job Run' panel.


| 2020-10-08 19:49:09 |ERROR|X|DQM|pid=21135.2474| owls_dqm_job_end          | u_dqm_end_job returns 3

| 2020-10-09 00:34:24 |ERROR|X|DQM|pid=21135.6323| owls_dqm_job_end          | u_dqm_end_job returns 3

| 2020-10-09 00:37:32 |ERROR|X|DQM|pid=21135.6336| owls_dqm_job_end          | u_dqm_end_job returns 3



Dollar Universe 6.9


After a Dollar Universe node restart, a running job launching loop could end aborted without being actually ended.
This was due to a system error when checking a child process during a fork procedure. To avoid this issue, the job status check is being performed again the next cycle of DQM check.



This is a bug corrected in version 6.10.11. Upgrade to this version or to a higher version to have the problem fixed.