Windows Agent in cluster doesn't retrieve the second Automation Engine when the first is down
Article ID: 84433
CA Automic Workload Automation - Automation EngineAUTOMIC WORKLOAD AUTOMATION
Error Message : U02000042 Connection aborted. Error code '10053', error description: 'An established connection was aborted by the software in your host machine.'.
When using an Automation Engine (AE) cluster (Active-Active) with agents able to switch from one to another Engine in case the other Engine falls (OS issue, Network issue, Shut down of the Virtual Machine,...). For now, when an Engine is shut down gracefully, agents appear to automatically connect to the other Communication Processes (CPs) as expected.
But when an Engine is shut down "violently" (Shut down of the Virtual Machine) the connected agents do not automatically failover to the other CPs running on the other active AE.
This case is happening for all Windows agents in a clustered environment.
Connect Agents to a given AE CP in an Active-Active cluster
Take the node offline either using sudden power off or a network disconnection
Agents connected the now offline instance are unaware of the failover
They take between 15-18 Minutes to realize, after this they simply show as offline
As soon as the AE Server is taken offline, it is no longer possible to communicate to the given Agent.
Cause type: Defect Root Cause: The KEEP_ALIVE variable is not correctly used by the Agent, which doesn't reconnect to the second Automation Engine directly.
OS: All Windows
Update to a fix version listed below or a newer version if available.
Fix Status: In Progress
Fix Version(s): Automation Engine 12.2.0 - Planned release date: 2018-06-19 Automation Engine 12.1.1 - Available Automation Engine 12.0.4 - Available
Workaround : Completely restart the Windows Agent.