In the context of DQM with logical queue on Node A pointing to a physical queue on Node B, the jobs are submitted into Node A to the logical queue.
Randomly, one of the Logical Queues (a different one every night) "freezes" and stop submitting the Jobs to the Physical Queue where the Logical Queue points to.
All the jobs submitted around this time in this queue remain in status Pending and are not sent to the remote Physical queue node.
Other queues around the same time continue working fine.
Example of an occurrence:
The only errors that appear in universe.log are the following
a) On logical queue node: | 2022-02-23 01:11:28 |ERROR|X|DQM|pid=11978.140195785111296| u_dqm_cli_thread_trt | new client authentication failed: b) On physical queue node: | 2022-02-23 01:11:28 |ERROR|X|DQM|pid=5701708.17220| k_handshakeAuthent | u_req_serv to /[logicalqueuenode] in error [-2] | 2022-02-23 01:11:30 |ERROR|X|DQM|pid=5701708.17220| k_connect_auth | Request authentication to /[logicalqueuenode] in error [-1] (check parameter timeout for UVMS connexion ) | 2022-02-23 01:11:30 |ERROR|X|DQM|pid=5701708.17220| owls_connect_auth | k_connect_auth_timeout(logicalqueuenode/DQM) returns error  | 2022-02-23 01:11:30 |ERROR|X|DQM|pid=5701708.17220| o_callsrv_connect_r | Connection error 0 [Comlayer error]
Release : 6.x
Component : DOLLAR UNIVERSE
Context: Jobs submitted to a Logical Queue that points to a remote Physical Queue defined in a different node.
To unblock the situation, simply launch a new job to the impacted queue, all the Pending jobs will be resubmited automatically as soon as this is done.
This problem is currently being worked on by Engineering and the planned fix delivery method will be communicated ASAP