Agent status not detected by the Automation Engine

Products

CA Automic Workload Automation - Automation Engine

Issue/Introduction

In rare cases the connection status of the Agent is not updated in the Automation Engine.

This can lead to a situation where Jobs are waiting in status 'Start Initiated' where 'Waiting for Host' would be expected.

The errors in the Agent log associated with the issue are:

U02002036 Could not receive anything from partner '*SERVER'. Error code '110(Connection timed out),S(11(ID=3))'.
or
U02002036 Could not receive anything from partner '*SERVER'. Error code '104(Connection timed out),S(11(ID=3))'.

Environment

Release : 12.3

Component : AUTOMATION ENGINE

Cause

In most situation the connection between the Agent and the AE is terminated by either server and detected as such.

However, if an element in the middle, like a router, fails, this is not always detected by the Agent or the AE.

Resolution

STEP 1

First check the setting of KEEP_ALIVE in the UC_HOSTCHAR_* applicable for the agent. Set the values to 150 to 300 for the specific agent to test if this resolves the issue.

(If you have no specific UC_HOSTCHAR_* for the affected agent create it because setting it to a low value creates overhead if all agents have a low values). See this KB:

https://knowledge.broadcom.com/external/article?articleId=88825

STEP 2

If Step 1 doesn't resolve the situation try the following:

To accommodate the connection of 10.000+ (or even 100.000+) agents to the Automation Engine, the connection handling was moved from our application (agent / AE) to the tcp stack.

The different parameters that are important to handle the connection on the level of the tcp stack are the *tcp_keepalive* parameters:

*tcp_keepalive_time (time until first probe) the default value on most Linux distros is 7200 seconds
*tcp_keepalive_intvl (probe interval)
*tcp_keepalive_probes (number of probes)

(you can visualize these with the command 'sysctl -a | grep tcp_keepalive')

Normally, these will help to detect disconnection after 5, 10 or 15 minutes as required by reducing *tcp_keepalive_time to 300, 600 or 900 seconds.

STEP 3

In the case of 'isolation' an additional parameter is often required.

(Isolation means that a network element 'in the middle' fails and neither agent nor engine is alerted that the connection is cut, this is detected by the OS with error codes 104 and 110 as indicated above.)

The parameter that can help to detect these situations is:

*tcp_retries2

This parameter is a bit difficult to control because it doesn't behave 'linearly. It does however offers the possibility to detect the 'disconnected' state of the connection under 5 minutes with a value between 5 and 10.

The values to choose are explained here: https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html

Some more information about tcp_keepalive and tecp_retires2 here: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/