search cancel

CP crashes when Windows OS is being updated with Connection to agent already exists

book

Article ID: 253773

calendar_today

Updated On:

Products

CA Automic Workload Automation - Automation Engine

Issue/Introduction

Several CP processes crash generating a core dump (as it's running on Linux) when some Windows Agent logs off/on over a short period of time that corresponds to an Operating System Windows Update.

Snippet of the messages that appear in the associated CP log:

20220923/000236.998 - U00003366 Connection to agent 'AGENT1' already exists (old connection '*CP006#00000711', new connection '*CP003#00001441').
20220923/000237.002 - U00003407 Client connection '1441(67)' from 'x.x.x.x:49617' has logged off from the Server.
20220923/000237.002 - U00003397 Agent 'AGENT1' logged off (client connection='1441').
20220923/000237.007 - U00003406 Client connection '1442(68)' from 'x.x.x.x:49618' has logged on to the Server.
20220923/000237.011 - U00003412 Agent 'AGENT1' logged on (Client connection='1442').
20220923/000237.019 - U00003366 Connection to agent 'AGENT1' already exists (old connection '*CP006#00000711', new connection '*CP003#00001442').
20220923/000237.024 - U00003407 Client connection '1442(67)' from 'x.x.x.x:49618' has logged off from the Server.
20220923/000237.024 - U00003397 Agent 'AGENT1' logged off (client connection='1442').
20220923/000237.029 - U00003406 Client connection '1443(68)' from 'x.x.x.x:49619' has logged on to the Server.
20220923/000237.034 - U00003412 Agent 'AGENT1' logged on (Client connection='1443').
20220923/000237.179 - U00003406 Client connection '1444(69)' from 'y.y.y.y:60395' has logged on to the Server.
20220923/000237.184 - U00003412 Agent 'AGENT2' logged on (Client connection='1444').
20220923/000237.191 - U00003366 Connection to agent 'AGENT2' already exists (old connection '*CP003#00000966', new connection '*CP003#00001444').
20220923/000237.191 - U00003365 Checking if agent 'AGENT2' responds to connection '*CP003#00000966' .
20220923/000240.976 - U00003407 Client connection '1444(68)' from 'y.y.y.y:60395' has logged off from the Server.
20220923/000240.976 - U00003397 Agent 'AGENT2' logged off (client connection='1444').
20220923/000242.035 - U00003407 Client connection '1066(67)' from '159.103.213.179:63274' has logged off from the Server.
20220923/000242.035 - U00003397 Agent 'SRP13090WN.JULIUSBAER.COM' logged off (client connection='1066').
20220923/000330.474 - U00003406 Client connection '1445(68)' from 'y.y.y.y:49917' has logged on to the Server.
20220923/000330.542 - U00003412 Agent 'AGENT2' logged on (Client connection='1445').
20220923/000330.611 - U00003366 Connection to agent 'AGENT2' already exists (old connection '*CP003#00000966', new connection '*CP003#00001445').

Then the CP process stops abnormally generating a core file without an ending message in the log.

The core analysis gives the following:

Core was generated by `/opt/uc4/server/bin/ucsrvcp /opt/uc4/server/bin/ucsrv.ini -svc8872'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f1a650ca4fb in __flockfile (stream=0x402e) at ../nptl/sysdeps/pthread/flockfile.c:28
28 _IO_lock_lock (*stream->_lock);
(gdb) bt full
#0 0x00007f1a650ca4fb in __flockfile (stream=0x402e) at ../nptl/sysdeps/pthread/flockfile.c:28
        __self = 0x7ffdcd4f1730
        __self = 0x7ffdcd4f1730
#1 0x00007f1a6293fb12 in ?? ()
No symbol table info available.
#2 0x0000000000000000 in ?? ()

Environment

Release : 12.3.x

Component: Automation Engine

Area impacted: CP process

Cause

Could not be fully determined but seems to be that the Windows Agent ends ungracefully during OS patching.

Due to a particular network configuration, the socket closure on the Agent side does not seem to reach the CP where the agent was connected, and the CP (and the complete AE system as well) thinks that the agent is still connected and process CP crashes when trying to reuse a socket that is no longer valid.

Resolution

Workaround:

  1. If you want to avoid the issue only for a particular subset of Agents:
    Reduce the KEEP_ALIVE in UC_HOSTCHAR_* for the agents impacted:
    https://docs.automic.com/documentation/webhelp/english/AA/12.3/DOCU/12.3/Automic%20Automation%20Guides/Content/AWA/Variables/UC_HOSTCHAR_DEFAULT.htm#link24
    -> by default is 600, please set it to 60
    That way the keep alive will be sent every 60s and offline agents will be detected faster, until the network configuration is corrected.
  2. Else (this will affect all agents), in order to avoid this kind of double connection check, you can set the system setting to a lower value or in the config file for the Automation Engine (ucsrv.ini) you can set it (tcp_KeepAlive_Time) to a lower value:
    tcp_KeepAlive_Time=60

    with this setting the Operating System will notify the socket faster if the connection is still valid or not.

Solution:

Fix the network connectivity between Windows Agents and AE, so that the socket closing information is sent to the CPs (AE) when the System hosting the Agents is being shutdown/patched.

This problem should NOT occur on version 21 as the Agents will now connect to a JCP process instead, no more connections for 12.3.x will be possible as it's on End of Maintenance at the time this problem was discovered.