Job Events for a remote Job are not created when a Node is stopped
search cancel

Job Events for a remote Job are not created when a Node is stopped

book

Article ID: 252915

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

The problem happens when a Job submitted on the local node is in status 'Event Wait' expecting the Job submitted on a remote node to complete.

The conditioning Job completes on the remote node and the local node is restarted, the expected Event is not created on the local node and the conditioned Job remains in status 'Event Wait' indefinitely. 

Scenario to reproduce the problem:

  • A Uproc A on a Node A conditioned on Uproc B on a remote Node B. Uproc A stays in Event Wait.
  • Node A is stopped.
  • Uproc B on Node B runs and completes.
  • Node 1 ist started again.
  • Uproc A remains in Event Wait.

On Node A universe.log with log level 0,EXCAGENT similar messages can be found:

| 2021-05-28 17:40:00 |TRACE|X|IO |pid=13709.139677143123712| o_module_exc_cycle    | Sleeping until 20361231235900 (time out set to 31536000 sec)
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=13709.139677143123712| o_module_exc_cycle    | Working... scanning CL
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=13709.139677143123712| o_module_exc_cycle_emissi | Exchange message send successfully to LNXNODE61041

On Node B Universe.log with log level 0,EXCAGENT similar messages can be found:

| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139811562178304| owls_exchanger_create   | Exchange order: type EVENEXEC lot 53 oper EM node LNXNODE61041
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle    | Working... scanning CL
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle_recept | RE Cycle of exchanger successful
| 2021-05-28 17:40:00 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle    | Sleeping until 20361231235900 (time out set to 31536000 sec)
| 2021-05-28 17:40:22 |ERROR|X|BVS|pid=15377.139776857708288| o_connect_auth      | k_connect_auth_timeout returns error [200]
| 2021-05-28 17:40:22 |ERROR|X|BVS|pid=15377.139776857708288| u_bvs_sync_trt_view_idle | At least one node is down, idle view (ARR.JC)-local-X not processed
| 2021-05-28 17:43:27 |ERROR|X|BVS|pid=15377.139776857708288| o_connect_auth      | k_connect_auth_timeout returns error [200]
| 2021-05-28 17:43:27 |ERROR|X|BVS|pid=15377.139776857708288| u_bvs_sync_trt_view_idle | At least one node is down, idle view (ARR.JC)-local-X not processed
| 2021-05-28 17:44:30 |TRACE|X|IO |pid=14716.139813382510336| o_module_exc_cycle    | Working... scanning CL
| 2021-05-28 17:44:30 |ERROR|X|IO |pid=14716.139813382510336| owls_api_exchanger_create | Error 200 connecting to node LNX892074 port 12500
| 2021-05-28 17:44:30 |ERROR|X|IO |pid=14716.139813382510336| o_module_exc_cycle_emissi | Cannot send exchanger message to remote node LNX892074

Environment

Component: Dollar Universe

Version: 6.x and 7.x

Cause

This is a defect 

Resolution

This bug is fixed in versions 

Dollar Universe 6.10.111 -- expected to be available end June 2023

Dollar Universe 7.0.21 -- expected to be available by November 2023  

Additional Information

Possible Workaround:

1. Updating the Event for Uproc B on Node B
2. Creating the Event for Uproc B on Node B on Node A
3. Running Uproc A again