DUAS: Jobs abort after server reboot - "The related object could not be found"

search cancel

DUAS: Jobs abort after server reboot - "The related object could not be found"

book

Article ID: 213995

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

After a System Reboot, of a Windows Cluster where Dollar Universe was installed in Cluster Mode, all Uprocs go from Pending to Aborted with error "could not submit" displayed in Actions column and when trying to retrieve the job log, we get an error message "The related object could not be found".

While looking at the universe.log, we can see that DQM failed to start after the system reboot:

| 2021-04-10 16:14:34 |ERROR|X|DQM|pid=12144.12148| u_ouv_serv |failed in u_listen: (WinSock): Address already in use

...

| 2021-04-10 16:15:02 |ERROR|X|IO |pid=10396.10532| u_io_launch_engines |Unable to check DQM start before starting LAN in area X

...
| 2021-04-12 12:08:27 |ERROR|X|IO |pid=10396.12176| owls_connect_auth | k_connect_auth_timeout(SERVER-VIP/DQM) returns error [200]

Environment

Release : 6.x

Component : DOLLAR UNIVERSE

Environment: Windows Cluster

Cause

Jobs aborted because DQM engine was stopped.

Root cause it's due to an incorrect cluster configuration:

Dollar universe service IO and related engines ( DQM,CDJ,BVS,EEP,GSI) should be properly stopped in the first member of the cluster before trying to attempt to start (Failover) to the other member of the cluster.

When the node stops correctly, the folder data\exp\local is emptied, as we can see here, it was not:

Resolution

Make sure that you are using a dedicated virtual hostname/ virtual IP address to the Dollar Universe resources (not the same used for the Windows Failover or SQL Server).

Consequently, edit the Dollar Universe resource definition in the Windows Failover tool, adding the dependencies to the virtual hostname and virtual IP addresses and shared folder to the service IO_X.

There is no need to add the EEP/JEE services on the dollar universe cluster definition as they are automatically started or stopped when the IO service is started or stopped.

Once this is done, stop the IO service and validate that all related engines are properly stopped before attempting to start the node again.

In this case, killing the previous DQM / CDJ / BVS / GSI / EEP processes from the other server allowed to fix the issue, the same can be done restarting the associated Windows Server.

Feedback

thumb_up Yes

thumb_down No