search cancel

DUAS fails to submit jobs with fork failed errno(11) and pthread_create returns 1

book

Article ID: 238623

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

Several Linux nodes have issues to display the Job Rus, with error:

The server is unreachable. Reason:NIO CDJ Connection problem

Or

The server is unreachable. Reason: NIO CDJ Connection problem - read header

In the universe.log we can find some thread related errors as below, but System Administrators could not find the root cause:  

|ERROR|X|DQM|pid=p.t| u_dqm_lanc_batch| fork failed errno (11) Resource temporarily unavailable
|ERROR|X|DQM|pid=p.t| u_dqm_lanc_batch| fork failed errno (11) Resource temporarily unavailable
|ERROR|X|CDJ|pid=p.t| o_spawn_thread  | Thread error: pthread_create returns 1 (errno 11: Resource temporarily unavailable)
|ERROR|X|CDJ|pid=p.t| u_create_thread | Thread error: o_spawn_thread_set_stack returns 1
|INFO |X|CDJ|pid=p.t| u_cdj_main      | Creation of thread failed => we do not process the request and disconnects the client on socket 6
|ERROR|X|CDJ|pid=p.t| u_cdj_trt_req   | new client authentication failed: Errno syserror 9: Bad file descriptor


Do you have any idea what could be the cause and solution?

Cause

The maxthread system limits had been reached, keeping processes from creating new threads ( like DQM to submit a new job or CDJ to create a new thread for a new monitoring request coming from UVC).

Environment

Release : 6.x

Component : DOLLAR UNIVERSE

OS: Linux only 

Resolution

Workaround:

Kill the processes using most threads as there must probably be a thread leak in one of the running processes.

This could be checked with the command ps -aefT

Solution:

Increase the system maximum thread/processes limit if necessary.

For instructions for Suse Linux, please check this page