After a System Reboot, of a Windows Cluster where Dollar Universe was installed in Cluster Mode, all Uprocs go from Pending to Aborted with error "could not submit" displayed in Actions column and when trying to retrieve the job log, we get an error message "The related object could not be found".
While looking at the universe.log, we can see that DQM failed to start after the system reboot:
Jobs aborted because DQM engine was stopped.
Root cause it's due to an incorrect cluster configuration:
Dollar universe service IO and related engines ( DQM,CDJ,BVS,EEP,GSI) should be properly stopped in the first member of the cluster before trying to attempt to start (Failover) to the other member of the cluster.
When the node stops correctly, the folder data\exp\local is emptied, as we can see here, it was not:
|INFO |X|IO |pid=p.t| o_Auth_ILocal_OWLS_Del_Ke | WARNING: local keys directory [D:\AUTOMIC\DUAS\COMPANY_NODE\data\exp\local] is not empty.
|INFO |X|IO |pid=p.t| o_Auth_ILocal_OWLS_Del_Ke | WARNING: deleting local keys before START.
Release : 6.x
Component : DOLLAR UNIVERSE
Environment: Windows Cluster
Make sure that you are using a dedicated virtual hostname/ virtual IP address to the Dollar Universe resources (not the same used for the Windows Failover or SQL Server).
Consequently, edit the Dollar Universe resource definition in the Windows Failover tool, adding the dependencies to the virtual hostname and virtual IP addresses and shared folder to the service IO_X.
There is no need to add the EEP/JEE services on the dollar universe cluster definition as they are automatically started or stopped when the IO service is started or stopped.
Once this is done, stop the IO service and validate that all related engines are properly stopped before attempting to start the node again.
In this case, killing the previous DQM / CDJ / BVS / GSI / EEP processes from the other server allowed to fix the issue, the same can be done restarting the associated Windows Server.