Windows Agent in cluster doesn't retrieve the second Automation Engine when the first is down

book

Article ID: 84433

calendar_today

Updated On:

Products

CA Automic Workload Automation - Automation Engine AUTOMIC WORKLOAD AUTOMATION

Issue/Introduction

Error Message :
U02000042 Connection aborted. Error code '10053', error description: 'An established connection was aborted by the software in your host machine.'.

When using an Automation Engine (AE) cluster (Active-Active) with agents able to switch from one to another Engine in case the other Engine falls (OS issue, Network issue, Shut down of the Virtual Machine,...). For now, when an Engine is shut down gracefully, agents appear to automatically connect to the other Communication Processes (CPs) as expected.

But when an Engine is shut down "violently" (Shut down of the Virtual Machine) the connected agents do not automatically failover to the other CPs running on the other active AE.

This case is happening for all Windows agents in a clustered environment.

Investigation
  • Connect Agents to a given AE CP in an Active-Active cluster
  • Take the node offline either using sudden power off or a network disconnection
  • Agents connected the now offline instance are unaware of the failover
  • They take between 15-18 Minutes to realize, after this they simply show as offline
  • As soon as the AE Server is taken offline, it is no longer possible to communicate to the given Agent.

Cause

Cause type:
Defect
Root Cause: The KEEP_ALIVE variable is not correctly used by the Agent, which doesn't reconnect to the second Automation Engine directly.

Environment

OS: All Windows

Resolution

Update to a fix version listed below or a newer version if available.

Fix Status: In Progress

Fix Version(s):
Automation Engine 12.2.0 - Planned release date: 2018-06-19
Automation Engine 12.1.1 - Available
Automation Engine 12.0.4 - Available

Additional Information

Workaround :
Completely restart the Windows Agent.