Failover cluster node problem

book

Article ID: 223283

calendar_today

Updated On:

Products

CA Automic Dollar Universe

Issue/Introduction

Problem:

Node 1 Failover to Node 2 works but when failover back from Node 2 to Node 1 there is a problem with start of some Servers Engines like: BVS, CDJ, DQM, EEP or GSI.

We can see following errors in the Node 1 logs:

| 2021-09-03 17:43:23 |ERROR|X|IO |pid=9620.24004| u_call_srvio_ver          | Error connecting to BVS X: (WinSock): Connection reset by peer
| 2021-09-03 17:43:23 |WARN |X|IO |pid=9620.24004| u_close_socket            | shutdown returns -1/10038 (socket 1488)
| 2021-09-03 17:43:23 |ERROR|X|IO |pid=9620.24004| u_call_srvio_ver          | Error connecting to CDJ X: (WinSock): Connection reset by peer
| 2021-09-03 17:43:23 |WARN |X|IO |pid=9620.24004| u_close_socket            | shutdown returns -1/10038 (socket 1460)
| 2021-09-03 17:43:23 |ERROR|X|IO |pid=9620.24004| u_call_srvio_ver          | Error connecting to DQM X: (WinSock): Connection reset by peer
| 2021-09-03 17:43:23 |WARN |X|IO |pid=9620.24004| u_close_socket            | shutdown returns -1/10038 (socket 856)
| 2021-09-03 17:43:23 |ERROR|X|IO |pid=9620.24004| u_call_srvio_ver          | Error connecting to EEP X: (WinSock): Connection reset by peer
| 2021-09-03 17:43:23 |WARN |X|IO |pid=9620.24004| u_close_socket            | shutdown returns -1/10038 (socket 1548)
| 2021-09-03 17:43:23 |ERROR|X|IO |pid=9620.24004| u_call_srvio_ver          | Error connecting to GSI X: (WinSock): Connection reset by peer
| 2021-09-03 17:43:23 |WARN |X|IO |pid=9620.24004| u_close_socket            | shutdown returns -1/10038 (socket 916)

The Servers Engines like: BVS, CDJ, DQM, EEP or GSI could not be started because there were leftover processes on Node 1. The processes can be spotted in Windows Task Manager.

We can see on Node 1 that the folder data\exp\local is not emptied, so the Node 1 was not correctly stopped:

|INFO |X|IO |pid=p.t| o_Auth_ILocal_OWLS_Del_Ke | WARNING: local keys directory [D:\AUTOMIC\DUAS\COMPANY_NODE\data\exp\local] is not empty.
|INFO |X|IO |pid=p.t| o_Auth_ILocal_OWLS_Del_Ke | WARNING: deleting local keys before START.

Cause

The Failover Cluster is not properly configured.

Environment

Release : 6.x

Component : DOLLAR UNIVERSE

Dollar Universe node installed in Cluster mode in Windows and added as Role in Windows Cluster Failover tool

 

Resolution

Steps to check:

- Make sure that you are using a dedicated virtual hostname/ virtual IP address to the Dollar Universe resources (not the same used for the Windows Failover or SQL Server).

- Edit the Dollar Universe resource definition in the Windows Failover tool, adding the dependencies to the virtual hostname and virtual IP addresses and shared folder to the service IO_X

- There is no need to add the EEP/JEE services on the dollar universe cluster definition as they are automatically started or stopped when the IO service is started or stopped.

Afterwards:

- Once this is done, stop the IO service and validate that all related engines are properly stopped before attempting to start the node again.

- In this case, killing the previous DQM / CDJ / BVS / GSI / EEP processes from the other server allowed to fix the issue, the same can be done restarting the associated Windows Server.

Additional Information

Try to register the node where the issue occurs with the unims command where XXXX is the dedicated hostname or virtual IP address (using the IP address is most safe way to use as it is unique)

unims -update -host XXXX

Attachments