Windows Agent in cluster doesn't retrieve the second Automation Engine when the first is down
book
Article ID: 84433
calendar_today
Updated On:
Products
CA Automic Workload Automation - Automation Engine
Issue/Introduction
Error Message : U02000042 Connection aborted. Error code '10053', error description: 'An established connection was aborted by the software in your host machine.'.
When using an Automation Engine (AE) cluster (Active-Active) with agents able to switch from one to another Engine in case the other Engine falls (OS issue, Network issue, Shut down of the Virtual Machine,...). For now, when an Engine is shut down gracefully, agents appear to automatically connect to the other Communication Processes (CPs) as expected.
But when an Engine is shut down "violently" (Shut down of the Virtual Machine) the connected agents do not automatically failover to the other CPs running on the other active AE.
This case is happening for all Windows agents in a clustered environment.
Investigation
Connect Agents to a given AE CP in an Active-Active cluster
Take the node offline either using sudden power off or a network disconnection
Agents connected the now offline instance are unaware of the failover
They take between 15-18 Minutes to realize, after this they simply show as offline
As soon as the AE Server is taken offline, it is no longer possible to communicate to the given Agent.
Cause
Cause type: Defect Root Cause: The KEEP_ALIVE variable is not correctly used by the Agent, which doesn't reconnect to the second Automation Engine directly.
Environment
OS: All Windows
Resolution
Update to a fix version listed below or a newer version if available.
Fix Status: In Progress
Fix Version(s): Automation Engine 12.2.0 - Planned release date: 2018-06-19 Automation Engine 12.1.1 - Available Automation Engine 12.0.4 - Available
Additional Information
Workaround : Completely restart the Windows Agent.