The process uxdqmsrv in the context of the Logical queue node where jobs are submitted, shows a very important memory leak, as the process loops infinitely when using customer DQM data files that had been corrupted after a filesystem full issue.
As a result, the process eventually uses all memory of the System and crashes, and this will occur every time the DQM process tries to be started with the corrupted data files (u_jobfile.dta, u_prmfile.dta, u_quefile.dta).
This could be checked via top the usage of uxdqmsrv, it will allocate ALL system memory in a few minutes:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29259 univa 20 0 2683m 464m 2032 S 97.8 16.1 0:13.85 uxdqmsrv
29259 univa 20 0 2875m 663m 2032 S 98.1 23.0 0:19.79 uxdqmsrv
29259 univa 20 0 3579m 1.4g 2032 S 100.3 48.9 0:42.31 uxdqmsrv
Release : 6.x
Component : DOLLAR UNIVERSE
Update to a fix version listed below or a newer version if available.
Component: Dollar Universe
Dollar Universe 6.10.101 - Planned Release Second Half May 2022
Fixed the algorithm in this rare case, and added a deletion of unwanted jobs records as well as errors in the universe.log messages.
Example of the new messages in universe.log when this problem is detected:
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique | Error reading queue [JT_QP] on the local node 
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique | Queue [JT_QP] may not exist locally, check your queues definitions
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique | Deleting invalid job entry [000003 JT_QL 9000000000I000000019 2022032118364500 100 100 PNO U000000026 XI000000019U000006341000 ]
|INFO |X|DQM|pid=p.t| delete_incorrect_records_ | Job in queue status [P]. Only updating general + pending counters