Several Linux nodes have issues to display the Job Rus, with error:
The server is unreachable. Reason:NIO CDJ Connection problem
Or
The server is unreachable. Reason: NIO CDJ Connection problem - read header
In the universe.log we can find some thread related errors as below, but System Administrators could not find the root cause:
|ERROR|X|DQM|pid=p.t| u_dqm_lanc_batch| fork failed errno (11) Resource temporarily unavailable
|ERROR|X|DQM|pid=p.t| u_dqm_lanc_batch| fork failed errno (11) Resource temporarily unavailable
|ERROR|X|CDJ|pid=p.t| o_spawn_thread | Thread error: pthread_create returns 1 (errno 11: Resource temporarily unavailable)
|ERROR|X|CDJ|pid=p.t| u_create_thread | Thread error: o_spawn_thread_set_stack returns 1
|INFO |X|CDJ|pid=p.t| u_cdj_main | Creation of thread failed => we do not process the request and disconnects the client on socket 6
|ERROR|X|CDJ|pid=p.t| u_cdj_trt_req | new client authentication failed: Errno syserror 9: Bad file descriptor
Do you have any idea what could be the cause and solution?
Release : 6.x
Component : DOLLAR UNIVERSE
OS: Linux only
The maxthread system limits had been reached, keeping processes from creating new threads ( like DQM to submit a new job or CDJ to create a new thread for a new monitoring request coming from UVC).
Kill the processes using most threads as there must probably be a thread leak in one of the running processes.
This could be checked with the command ps -aefT
Increase the system maximum thread/processes limit if necessary.
For instructions for Suse Linux, please check this page