ALERT: Some images may not load properly within the Knowledge Base Article. If you see a broken image, please right-click and select 'Open image in a new tab'. We apologize for this inconvenience.

DQM: After a data corruption it uses all memory until it crashes

book

Article ID: 239172

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

The process uxdqmsrv in the context of the Logical queue node where jobs are submitted, shows a very important memory leak, as the process loops infinitely when using customer DQM data files that had been corrupted after a filesystem full issue.
As a result, the process eventually uses all memory of the System and crashes, and this will occur every time the DQM process tries to be started with the corrupted data files (u_jobfile.dta, u_prmfile.dta, u_quefile.dta).

This could be checked via top the usage of uxdqmsrv, it will allocate ALL system memory in a few minutes:

PID USER   PR NI VIRT RES SHR S %CPU %MEM  TIME+ COMMAND
29259 univa   20  0 2683m 464m 2032 S 97.8 16.1  0:13.85 uxdqmsrv
...
29259 univa   20  0 2875m 663m 2032 S 98.1 23.0  0:19.79 uxdqmsrv
...
29259 univa   20  0 3579m 1.4g 2032 S 100.3 48.9  0:42.31 uxdqmsrv

Cause

Defect

Environment

Release : 6.x

Component : DOLLAR UNIVERSE

Resolution

Workaround:

  1. Reinitialize the DQM queues via the command:
    $UNI_DIR_EXEC/uxresetque queue=*

  2. Else, you can rename the dqm files from the data folder  (u_quefile.* , u_prmfile.* and u_jobfile.*) and start DQM. 
    Then you need to recreate again ALL the queues as they were defined in u_quefile.* files

 

Solution:

Update to a fix version listed below or a newer version if available.

Fix version(s): 
Component: Dollar Universe
Dollar Universe 6.10.101 - Planned Release Second Half May 2022

Additional Information

Correction details:

Fixed the algorithm in this rare case, and added a deletion of unwanted jobs records as well as errors in the universe.log messages.

Example of the new messages in universe.log when this problem is detected:

|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique   | Error reading queue [JT_QP] on the local node [3]
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique   | Queue [JT_QP] may not exist locally, check your queues definitions
|ERROR|X|DQM|pid=p.t| u_dqm_trt_que_generique   | Deleting invalid job entry [000003                                                                   JT_QL                           9000000000I000000019                2022032118364500                100   100   PNO U000000026  XI000000019U000006341000                                                                                        ]
|INFO |X|DQM|pid=p.t| delete_incorrect_records_ | Job in queue status [P]. Only updating general + pending counters