Error in job log with Unix TLS agents: SESSION_ERROR TLS-handshake The socket was closed due to a timeout
Facing a problem with Automic Automation Kubernetes Edition (21.0.4+build.37) and Unix (Linux) agents 21.0.4+build.13.
The agents are deployed on premises, the AE JCPs are running in a managed K8s cluster inside AWS, and the two are connected via the TLS connection, using nginx ingress controller, through the corporate firewall.
The agent connection is stable, and no agent disconnection is happening. However, intermittently, mainly for longer running jobs (1 minute+) the following message is inserted into the job log, and the job fails:
***************************************************************************** ucxjlx6m version 21.0.4+build.13 changelist 1661882548 **** JOB 0001344298 (ProcID:0000013026) START AT 01.03.2023 / 08:42:44 **** UTC TIME 01.03.2023 / 07:42:44 **** TEXT=" Job started " *****************************************************************************-1 - wrong message type20230301/084314.086 U0009909 TRACE: (wrong type error) 0x1077e80 01268 00000000 53455353 494F4E5F 4552524F 52000000 >SESSION_ERROR...< 00000010 00000000 00000000 00000000 00000000 >................< 00000020 2A414745 4E540000 00000000 00000000 >*AGENT..........< 00000030= 00000000 00000000 00000000 00000000 >................< 00000060 01000000 01000000 756E6B6E 6F776E00 >........unknown.< 00000070= 00000000 00000000 00000000 00000000 >................< 000000F0 B9851E00 2A414745 4E547C54 4C532D68 >....*AGENT|TLS-h< 00000100 616E6473 68616B65 2F312854 68652073 >andshake/1(The s< 00000110 6F636B65 74207761 7320636C 6F736564 >ocket was closed< 00000120 20647565 20746F20 61207469 6D656F75 > due to a timeou< 00000130 74290000 00000000 00000000 00000000 >t)..............< 00000140= 00000000 00000000 00000000 00000000 >................< 000004F0 00000000 >....<-1 - timeout-1 - timeout
The issue is not related to what the job itself contains, it can be triggered with a simple Bash job containing just "sleep 1200" The error message is likely inserted when the job messenger is running.
The job log contains just the following:
20230301/084244.011 - U02000005 Job 'JOBS.WLA.TESTCASE' with RunID '1344298' is to be started.
20230301/084244.040 - U02000003 Job 'JOBS.WLA.TESTCASE' started with RunID '1344298'.
20230301/084344.013 - U02000015 Periodical job test started.
...
20230301/090130.009 - U02000009 Job 'JOBS.WLA.TESTCASE' with RunID '1344298' ended with return code '15'.
Release : 21.0.4
Unix job messenger disconnection issue was solved with 21.0.5 HF1