search cancel

When a node is restarted, one or some launches go to aborted even though the script keeps running

book

Article ID: 201499

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

When  a node is restarted, DQM can loose the reference to the PIDs of some of the running jobs. 

Impacted jobs end then as aborted in Dollar Universe, whereas they are still running on the system.

These messages can be found in the universe.log for each occurrence (no trace level activated!!). The timestamp reflects the end time of the job in the Console 'Job Run' panel.

=====================================================================

| 2020-10-08 19:49:09 |ERROR|X|DQM|pid=21135.2474| owls_dqm_job_end          | u_dqm_end_job returns 3

| 2020-10-09 00:34:24 |ERROR|X|DQM|pid=21135.6323| owls_dqm_job_end          | u_dqm_end_job returns 3

| 2020-10-09 00:37:32 |ERROR|X|DQM|pid=21135.6336| owls_dqm_job_end          | u_dqm_end_job returns 3

=====================================================================

Cause

After a Dollar Universe node restart, a running job launching loop could end aborted without being actually ended.
This was due to a system error when checking a child process during a fork procedure. To avoid this issue, the job status check is being performed again the next cycle of DQM check.

 

Environment

Dollar Universe 6.9

Resolution

This is a bug corrected in version 6.10.11. Upgrade to this version or to a higher version to have the problem fixed.