The problem happens when a Job submitted on the local node is in status 'Event Wait' expecting the Job submitted on a remote node to complete.
The conditioning Job completes on the remote node and the local node is restarted, the expected Event is not created on the local node and the conditioned Job remains in status 'Event Wait' indefinitely.
Scenario to reproduce the problem:
On Node A universe.log with log level 0,EXCAGENT similar messages can be found:
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=13709.139677143123712| o_module_exc_cycle | Sleeping until 20361231235900 (time out set to 31536000 sec)
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=13709.139677143123712| o_module_exc_cycle | Working... scanning CL
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=13709.139677143123712| o_module_exc_cycle_emissi | Exchange message send successfully to LNXNODE61041
On Node B Universe.log with log level 0,EXCAGENT similar messages can be found:
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139811562178304| owls_exchanger_create | Exchange order: type EVENEXEC lot 53 oper EM node LNXNODE61041
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle | Working... scanning CL
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle_recept | RE Cycle of exchanger successful
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle | Sleeping until 20361231235900 (time out set to 31536000 sec)
| 2021-05-28 17:40:22 |ERROR|X|BVS|pid=15377.139776857708288| o_connect_auth | k_connect_auth_timeout returns error [200]
| 2021-05-28 17:40:22 |ERROR|X|BVS|pid=15377.139776857708288| u_bvs_sync_trt_view_idle | At least one node is down, idle view (ARR.JC)-local-X not processed
| 2021-05-28 17:43:27 |ERROR|X|BVS|pid=15377.139776857708288| o_connect_auth | k_connect_auth_timeout returns error [200]
| 2021-05-28 17:43:27 |ERROR|X|BVS|pid=15377.139776857708288| u_bvs_sync_trt_view_idle | At least one node is down, idle view (ARR.JC)-local-X not processed
| 2021-05-28 17:44:30 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle | Working... scanning CL
| 2021-05-28 17:44:30 |ERROR|X|IO |pid=14716.139813382510336| owls_api_exchanger_create | Error 200 connecting to node LNX892074 port 12500
| 2021-05-28 17:44:30 |ERROR|X|IO |pid=14716.139813382510336| o_module_exc_cycle_emissi | Cannot send exchanger message to remote node LNX892074
Component: Dollar Universe
Version: 6.x and 7.x
This is a defect
This bug is fixed in versions
Dollar Universe 6.10.111 -- expected to be available end June 2023
Dollar Universe 7.0.21 -- expected to be available by November 2023
Possible Workaround:
1. Updating the Event for Uproc B on Node B
2. Creating the Event for Uproc B on Node B on Node A
3. Running Uproc A again