The process uxdqmsrv in the context of the Logical queue node where jobs are submitted, shows a very important memory leak, as the process loops infinitely when using customer DQM data files that had been corrupted after a filesystem full issue.
As a result, the process eventually uses all memory of the System and crashes, and this will occur every time the DQM process tries to be started with the corrupted data files (u_jobfile.dta, u_prmfile.dta, u_quefile.dta).
This could be checked via top the usage of uxdqmsrv, it will allocate ALL system memory in a few minutes:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29259 univa 20 0 2683m 464m 2032 S 97.8 16.1 0:13.85 uxdqmsrv
...
29259 univa 20 0 2875m 663m 2032 S 98.1 23.0 0:19.79 uxdqmsrv
...
29259 univa 20 0 3579m 1.4g 2032 S 100.3 48.9 0:42.31 uxdqmsrv
Release : 6.x
Component : DOLLAR UNIVERSE
Defect
$UNI_DIR_EXEC/uxresetque queue=*
Update to a fix version listed below or a newer version if available.
Fix version(s):
Component: Dollar Universe
Dollar Universe 6.10.101 - Available
Correction details:
Fixed the algorithm in this rare case, and added a deletion of unwanted jobs records as well as errors in the universe.log messages.
Example of the new messages in universe.log when this problem is detected:
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique | Error reading queue [JT_QP] on the local node [3]
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique | Queue [JT_QP] may not exist locally, check your queues definitions
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique | Deleting invalid job entry [000003 JT_QL 9000000000I000000019 2022032118364500 100 100 PNO U000000026 XI000000019U000006341000 ]
|INFO |X|DQM|pid=p.t| delete_incorrect_records_ | Job in queue status [P]. Only updating general + pending counters