Jobs get stuck in STARTING, statuses do not change, errors CAUAJM_W_00068 / CAUAJM_W_00060 seen in scheduler or application server logs
search cancel

Jobs get stuck in STARTING, statuses do not change, errors CAUAJM_W_00068 / CAUAJM_W_00060 seen in scheduler or application server logs

book

Article ID: 194075

calendar_today

Updated On:

Products

CA Workload Automation AE - Business Agents (AutoSys) CA Workload Automation AE - System Agent (AutoSys) CA Workload Automation AE - Scheduler (AutoSys) Workload Automation Agent Autosys Workload Automation

Issue/Introduction

AutoSys agent is marked offline by scheduler.  Jobs get stuck in STARTING,  statuses do not change.  Below errors are seen in scheduler or application server logs.

==========================

[06/11/2020 19:41:01]      CAUAJM_W_00060 Timed out waiting for incoming data at socket [70] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[0:N/A]
[06/11/2020 21:05:27]      CAUAJM_W_00060 Timed out waiting for incoming data at socket [5] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23797:WA_AGENT]
[06/11/2020 21:05:27]      CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [71] after 4 seconds. Discarding socket... [<IP_MASKED>] 

========================

[06/11/2020 22:02:55]      CAUAJM_I_40245 EVENT: ALARM            ALARM: AUTO_PING        JOB: <job_name> MACHINE:  <hostname>.<example.>.<com>
[06/11/2020 22:02:55]      <COMM_ERR_14 Agent on machine [<hostname>.<example.>.<com>] has not acknowledged this job request. Please investigate the status of this job.>
[06/11/2020 22:03:13]      CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [79] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT        WA_AGENT <hostname>.<example.>.<com>]
[06/11/2020 22:03:17]      CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [79] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT        WA_AGENT <hostname>.<example.>.<com>]

[06/11/2020 22:03:30]      CAUAJM_I_40245 EVENT: CHANGE_STATUS    STATUS: RUNNING         JOB: <job_name> MACHINE: <hostname>.<example.>.<com>
[06/11/2020 22:03:30]      CAUAJM_I_40245 EVENT: MACH_ONLINE      MACHINE: <hostname>.<example.>.<com>
[06/11/2020 22:03:30]      <The agent on machine <hostname>.<example.>.<com> is responsive. Automatically placing the machine online.>
[06/11/2020 22:03:30]      CAUAJM_I_40120 Completed 0 job start(s) for online machine<hostname>.<example.>.<com>.
[06/11/2020 22:10:11]      CAUAJM_I_40245 EVENT: CHANGE_STATUS    STATUS: STARTING        JOB: <<job_name> > MACHINE: <hostname>.<example.>.<com>
[06/11/2020 22:10:20]      CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [91] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT        WA_AGENT<hostname>.<example.>.<com>]
[06/11/2020 22:10:32]      CAUAJM_E_10229 Communication attempt with the CA WAAE Agent has failed! [<hostname>.<example.>.<com>:7520]
[06/11/2020 22:10:32]      CAUAJM_W_40290 Machine <hostname> is in question. Placing machine in the unqualified state.
[06/11/2020 22:10:36]      CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [90] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT        WA_AGENT<hostname>.<example.>.<com>]
[06/11/2020 22:10:41]      CAUAJM_I_40245 EVENT: ALARM            ALARM: STARTJOBFAIL     JOB: <<job_name> > MACHINE: <hostname>.<example.>.<com>
[06/11/2020 22:10:41]      <COMM_ERR_5 Communication attempt with Agent on machine [<hostname>.<example.>.<com>] has failed.>
[06/11/2020 22:10:44]      CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [90] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT        WA_AGENT<hostname>.<example.>.<com>]
[06/11/2020 22:10:48]      CAUAJM_E_10229 Communication attempt with the CA WAAE Agent has failed! [hostname:7520]
[06/11/2020 22:10:48]      CAUAJM_I_40245 EVENT: ALARM            ALARM: STARTJOBFAIL     JOB: <<job_name> > MACHINE: <hostname>.<example.>.<com>
[06/11/2020 22:10:48]      <COMM_ERR_5 Communication attempt with Agent on machine [<hostname>.<example.>.<com>] has failed.>
[06/11/2020 22:10:50]      CAUAJM_I_40245 EVENT: CHANGE_STATUS    STATUS: RESTART         JOB: <<job_name> > MACHINE: <hostname>.<example.>.<com>
[06/11/2020 22:10:52]      CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [68] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT        WA_AGENT<hostname>.<example.>.<com>]


[06/11/2020 22:12:27]      CAUAJM_W_40253 Machine <<hostname>.<example.>.<com>> is not responding.  Taking offline.
[06/11/2020 22:12:28]      CAUAJM_I_40245 EVENT: ALARM            ALARM: MACHINE_UNAVAILABLE MACHINE: <hostname>.<example.>.<com>
[06/11/2020 22:12:28]      <Machine <<hostname>.<example.>.<com>> did not respond.>

 



Environment

Workload Automation AE (AutoSys)

Resolution

There are 3 types of error messages in the log that pertain to network performance:

  • COMM_ERR_5 - Communication attempt with Agent failed:

    [06/11/2020 19:26:20] CAUAJM_E_10229 Communication attempt with the CA WAAE Agent has failed! [<hostname>.<example.>.<com>:7520]
    [06/11/2020 19:26:20] CAUAJM_W_40290 Machine is in question. Placing machine in the unqualified state.
    [06/11/2020 19:26:20] CAUAJM_I_40245 EVENT: ALARM ALARM: STARTJOBFAIL JOB: <hostname>.<example.>.<com>_<jobname> MACHINE:  <hostname>.<example.>.<com>

    This message indicates that the scheduler is unable to establish a connection with the agent after successfully resolving the agent's node name to an IP address. Functionally, the scheduler requests a TCP/IP socket connection from the operating system but the operating system is unable to complete the TCP/IP-level connection handshake with the TCP/IP network protocol stack on the agent machine. The handshake could have failed for any number of reasons: data is lost while being routed over the network, high resource on agent machine caused it to delay to receive the TCP/IP connection request or delay to acknowledge connection request, etc.


  • CAUAJM_W_00068 Timed out waiting for acknowledgement at socket :

    This error occurs after the scheduler successfully established a TCP/IP connection and sends AutoSys protocol network data to the agent. It waits for the agent to send an acknowledgement protocol message but the scheduler never receives it. This error does not always immediately result in a failure. The scheduler sends up to 4 times before failing the job.

    Here is an example where the scheduler times out 3 times but succeeds on the final attempt:

    [06/11/2020 17:20:44] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: STARTING JOB: <jobname> MACHINE: <agenthostname>
    ...
    [06/11/2020 17:20:48] CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [68] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT WA_AGENT <hostname>.<example.>.<com>]
    [06/11/2020 17:20:52] CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [68] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT WA_AGENT <hostname>.<example.>.<com>]
    [06/11/2020 17:20:56] CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [68] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT WA_AGENT <hostname>.<example.>.<com>]
    [06/11/2020 17:20:56] CAUAJM_I_10082 [ <hostname>.<example.>.<com> connected for <jobname> 12345.67890.1]

    Here is an example where 4 timeout failures results in a STARTJOBFAIL alarm:

    [06/11/2020 22:10:20] CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [91] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT WA_AGENT <hostname>.<example.>.<com>]
    [06/11/2020 22:10:24] CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [91] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT WA_AGENT <hostname>.<example.>.<com>]
    [06/11/2020 22:10:28] CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [91] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT WA_AGENT <hostname>.<example.>.<com>]
    [06/11/2020 22:10:32] CAUAJM_W_00068 Timed out waiting for acknowledgement at socket [91] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[23946:WA_AGENT WA_AGENT <hostname>.<example.>.<com>]
    [06/11/2020 22:10:41] CAUAJM_I_40245 EVENT: ALARM ALARM: STARTJOBFAIL JOB: <jobname> MACHINE: <hostname>.<example.>.<com>


  • CAUAJM_W_00060 Timed out waiting for incoming data at socket

    The scheduler displays this error when the agent successfully connects to the scheduler's agent listener port but does not send any data:

    [06/11/2020 17:00:40] CAUAJM_W_00060 Timed out waiting for incoming data at socket [5] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[0:N/A]
    [06/11/2020 17:00:44] CAUAJM_W_00060 Timed out waiting for incoming data at socket [5] after 4 seconds. Discarding socket... [<IP_MASKED>] qid:src[0:N/A]

    While the scheduler is timing out, its agent listener is occupied making it unable to receive job status updates from any agent.

    When the scheduler becomes so occupied that it cannot accept connections from the agent, the scheduler can internally queue at the OS level incoming socket connections with agent network data that is waiting to be acknowledged by the scheduler. If the agent does not receive any acknowledgments from the scheduler after 10 seconds, it will close the TCP/IP socket connection. If the TCP/IP close socket request for an internally queued socket is lost, the scheduler will think the connection is active and time out resulting in more CAUAJM_W_00060 errors.