Detailed Description and Symptoms
Investigation
The two settings to look at are in the ucsrv.ini file as described below:
*) PrimaryMode=
The first is to change your ucsrv.ini file on all Automation Engine components, setting:
PrimaryMode=1
This will cause the PWP to only focus on PWP tasks rather than work on regular WP tasks as well.
If PrimaryMode is set to 0, the PWP will do regular WP work as well rather than it's specific tasks. At times, a regular WP may have a call to the database for many reasons that can take over 10 minutes. If this happens and the PWP is doing the work, no other work is done at the time and the entire system can go down. Setting PrimaryMode to 1 will greatly reduce the chance of this happening.
More information on this setting can be found in the documentation under Administrator Manual, Configuration, Structure of the Configuration Files, Automic Servers.
*) srvquery=
The second setting that is important to have set in the ucsrv.ini file on all Automation Engine components, setting:
srvquery=1
This setting will cause the keepalive sent to the agent to be done by the CPs rather than the WPs, relieving some of the stress and constraints on the WPs.
There are some drawbacks to using srvquery=1. The "Last check" column in the agent overview will no longer be updated and if there is a situation where an agent and CP it's connected to go down at the same moment, there's a chance that the agent will not recognized as being stopped by the system since a CP is not available to report this to the automation engine server. This appears in 12.x, but has not yet appears in 21.0.
More information on this setting can be found in the documentation under Administrator Manual, Configuration, Structure of the Configuration Files, Automic Servers.