In rare cases the connection status of the Agent is not updated in the Automation Engine.
This can lead to a situation where Jobs are waiting in status 'Start Initiated' where 'Waiting for Host' would be expected.
The errors in the Agent log associated with the issue are:
U02002036 Could not receive anything from partner '*SERVER'. Error code '110(Connection timed out),S(11(ID=3))'.
U02002036 Could not receive anything from partner '*SERVER'. Error code '104(Connection timed out),S(11(ID=3))'.
In most situation the connection between the Agent and the AE is terminated by either server and detected as such.
However, if an element in the middle, like a router, fails, this is not always detected by the Agent or the AE.
Release : 12.3
Component : AUTOMATION ENGINE
First check the setting of KEEP_ALIVE in the UC_HOSTCHAR_* applicable for the agent. Set the values to 150 to 300 for the specific agent to test if this resolves the issue.
(If you have no specific UC_HOSTCHAR_* for the affected agent create it because setting it to a low value creates overhead if all agents have a low values). See this KB:
If Step 1 doesn't resolve the situation try the following:
To accommodate the connection of 10.000+ (or even 100.000+) agents to the Automation Engine, the connection handling was moved from our application (agent / AE) to the tcp stack.
The different parameters that are important to handle the connection on the level of the tcp stack are the tcp_keepalive* parameters:
tcp_keepalive_time (time until first probe)
tcp_keepalive_intvl (probe interval)
tcp_keepalive_probes (number of probes)
Normally, these will help to detect disconnection after 5, 10 or 15 minutes as required.
In the case of 'isolation' an additional parameter is often required.
(Isolation means that a network element 'in the middle' fails and neither agent nor engine is alerted that the connection is cut, this is detected by the OS with error codes 104 and 110 as indicated above.)
The parameter that can help to detect these situation is:
This parameter is a bit difficult to control because it doesn't behave 'linearly. It does however offers the possibility to detect the 'disconnected' state of the connection under 5 minutes with a value between 5 and 10.
The values to choose are explained here: https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html
Some more information about tcp_keepalive and tecp_retires2 here: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/