CP and agent ending at same time but agent still shows running srvquery=1

Products

CA Automic Workload Automation - Automation Engine

Issue/Introduction

When an agent the CP it connects to stop at the same time in the following scenario causes downstream effects:

srvquery is set to 1 in ucsrv.ini
agent and CP it's connected to end unexpectedly at the same time (smgr service is killed, agent and CP end abnormally, network issues, etc...)

Downstream issues:

system believes the agent is still up and running
agentgroups do not resolve correctly as they still attempt to reach that agent
write_process and other agent tasks/processes do not work as expected

Environment

Release : 12.3

Component :

Resolution

This behavior is as designed in 12.3. The reason is that the srvquery=1 setting is used in the ucsrv.ini file to have the CP keep track of agents and whether they are alive or not instead of the WPs. This allows for faster reconnections and less load on the PWP, but there are some drawbacks. There are three scenarios for agents and CPs going down:

When an agent goes down and the CP is still running, the CP reaches out to the agent, sees it's no longer available and sends this information to the automation engine.
When a CP goes down and the agents are still running, they reach out to the CP, see that they cannot reach it, then they reach out to the other CPs in their CP_LIST and are able to reconnect
When the agent and CP both go down within a very short time of each other, the CP does not have a chance to see the anent went down and the agent does not reach out to a different CP to show that it's going down. In this case, there is no signal to the automation engine that the agent stopped.

When srvquery is set to 0, the PWP reaches out to the agents regularly to see if they are still running and if they are not, it tells the database that and the system sees the agent is not running.

In version 21.0, the same situation is definitely possible with non-TLS enabled agents which connect to a traditional CP; the TLS enabled agents (most RA agents, UNIX agents, Windows agents) now connect to a JCP which is a Java process instead of the traditional non-java CP. This behavior was not reproducible with TLS-enabled agents on version 21.0.

At this point, the long-term solution will be to update to 21.0. With version 12.3, there are a couple of recommendations:

Make sure when bringing down systems or servers for maintenance that if agents must be brought down, they are brought down first and they are reflected in the system as being stopped in the AWI. Then stop CPs.
The second recommendation is to use srvquery=0 if at all possible if there are 12.3 agents connected to a system. srvquery=1 is usually used if there are a thousands of agents and disconnects and reconnects of these agents all at once cause problems OR if there are any disconnect issues where srvquery=1 was recommended by Support, but usually if there are less than a thousand agents in the system (please note that this number can vary from system to system) and there are usually not any problems with mass disconnect and reconnect of agents, we'd recommend srvquery=0 rather than srvquery=1.