DUAS: DQM stops submitting jobs with "Memory error"
search cancel

DUAS: DQM stops submitting jobs with "Memory error"

book

Article ID: 206190

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

A Dollar Universe node using Generic Batch Queues to submit Jobs, suddenly stops submitting any jobs with the following kind of errors:

|ERROR|X|DQM|pid=p.t1| u_dqm_trt_que             | u_dqm_trt_que_generique returns 101 [ZQ_DPXXLOG01  GD0009990000050000000000000000050000000000000000000000001001000N]
|ERROR|X|DQM|pid=p.t2| o_dqm_is_user_allowed     | Error [-1] while trying to get user id and group id for system user dollar_universe_user.
|ERROR|X|DQM|pid=p.t3| u_dqm_cli_thread_trt      | Memory error (32774)
|ERROR|X|IO |pid=p2.t1| owls_connect_auth         | k_connect_auth_timeout(DUAS_NODE_NAME/DQM) returns error [-1]

 

Environment

Release : 6.10

Component : DOLLAR UNIVERSE

OS: Unix/Linux

Cause

The DQM process seem to have reached a process memory limit and cannot create any more threads to submit Jobs.

This can be checked via a ps -elf | grep ux command or similar:

As it can be seen below, DQM was using a lot of memory, way more than it would be expected compared to the uxioserv process:

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S duniv 4311 17175 0 40 20 ? 21516 ? 14:31:06 ? 81:39 ./uxioserv COMPANY X NODE
0 S duniv 4518 17175 0 41 20 ? 517250 ? 14:31:06 ? 27:53 ./uxdqmsrv COMPANY X NODE

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S duniv 5545 17175 0 40 20 ? 12919 ? 14:31:10 ? 6:59 ./uxcdjsrv FIMPRO X BEX0454Z5A
0 S duniv 4311 17175 0 40 20 ? 21516 ? 14:31:06 ? 81:39 ./uxioserv FIMPRO X BEX0454Z5A
0 S duniv 4518 17175 0 41 20 ? 517250 ? 14:31:06 ? 27:53 ./uxdqmsrv FIMPRO X BEX0454Z5A

Resolution

Since the issue cannot be reproduced at will, the following measures should be taken to prevent it from occurring again:

1. Reinitialize the DQM batch queues when no jobs are running:

./uxresetque queue=*

2. Then, reduce the DQM stack size from 256 to 128 to decrease the DQM memory allocation:

./uniservar U_DQM_THREAD_STACK_SIZE 128


3.Restart the node again and monitor the DQM memory usage.

 

Should the issue occur again, please open a new case with Technical Support referencing this Knowledge Article.