What is "Keep Alive" and how does it work?
The so called "Keep Alive" is a healthy mechanism which checks periodically if the TCP/IP connection between Automic Agent and Automic Server works.
The KEEP_ALIVE parameter is set in the UC_HOSTCHAR_DEFAULT (or the corresponding UC_HOSTCHAR_*) variable in client 0:
Time interval for the periodic Automic Automation Engine check; Allowed values: 60 and above; Default value: 600 seconds
The value that is defined here must not be less than 60 seconds. Otherwise, the default value is used.
The specified value must also result in complete minutes (such as 60, 120, 180). If you use a different value, it is rounded up to the next minute (for example, a value of 99 seconds results in 120 seconds).
The mechanism in detail
The mechanism works in the following way:
1. The PWP sends an EXQUERY message to the Agent and waits for an EXINFO message sent back by the Agent.
1.1. If the answer arrives at the PWP within the KEEP_ALIVE time everything is fine. The PWP starts the next check after the KEEP_AILVE time is over (è back to 1.).
1.2. If the answer doesn't arrive within the KEEP_ALIVE time the Server drops the connection to the Agent.
2. The Agent gets the KEEP_ALIVE parameter when it connects to the Server and it adds 60 seconds. This is logged in the Agents logging (where &02 = KEEP_ALIVE + 60):
U2000017 The check interval for 'Server' has been set to '&02' seconds.
If the Agent gets no EXQUERY message within that time (KEEP_ALIVE + 60) it will send a SRVQUERY message to the Server. This happens only, if for any reason the EXQUERY form the Server doesn't reach the Agent!
2.1. If the answer from the Server (PWP) arrives within the KEEP_ALIVE + 60 time everything is fine. The Agents starts the next check, if necessary (è back to 21.).
2.2. If the answer doesn't arrive within the KEEP_ALIVE +60 time the Agent drops the connection to the Server.
Once an Agent is disconnected (case 1.2. or 2.2.), it will try to reconnect within the reconnect interval until the reconnect was successful.
So the KEEP_ALIVE is a bi-directional health check for the Agent - Sever connection, which guaranties a reconnect in case of any connection failure.
Note: Other parameter which influence the KEEP_ALIVE processing:
A) UC_SYSTEM_SETTINGS: SERVER_OPTIONS 9th digit
With this setting, the Agent is not disconnected if the time specified in KEEP_ALIVE is exceeded (è case 1.2. above). A message is written to the Server logging and the monitoring period is extended for the time specified in KEEP_ALIVE.
B) UCSRV.INI: [CPMsgTypes], srvquery
Performance optimization if many (several thousand) Agents log on at the same time. Allowed values: "0" (default value) and "1"; (è case 2.1. above)
"0" - The primary work process responds to the Agents' live messages.
"1" - The communication processes can process these specific messages and in doing so, they increase the Automic system performance.
From my many years of Automic experience I would recommend the following settings:
I. KEEP_ALIVE: Should be default, which is 600.
In really well performing environments (Automic Server, Automic Database, Automic Agents and Network) it can also be set lower, but never lower than 300!
A lower setting than 600 should only be used for single Agents, which should have the highest availability need!
II. UC_SYSTEM_SETTINGS: SERVER_OPTIONS 9th digit:
Should not be used. Make more new troubles else it fixes.
III. UCSRV.INI: [CPMsgTypes], srvquery:
Should be set to 1. It makes more sense that a CP answers a connection query, than the PWP. Good for PWP performance, especially in high PWP load situations.
Note: The KEEP_ALIVE is independent from the job submission or any other processing (Filetransfer, Events, etc.) of the Agent.