Processes stuck in running and waiting state

Products

CA Process Automation Base

Issue/Introduction

All of our workflows are in running or waiting state. Even test processes with a simple operator (ex: run script to run hostname) sits there spinning at the one/only operator.

Environment

Release : 4.3.04+

Cause

ITPAM was experiencing a performance problem related to the number of ActiveMQ_Msgs that built up. To help confirm this, we used the following two queries (because it is a clustered domain orchestrator environment):

select count(*), CONTAINER
FROM [PAM_RT].[dbo].Node0ACTIVE_MSGS] group by CONTAINER
select count(*), CONTAINER
FROM [PAM_RT].[dbo].Node1ACTIVE_MSGS] group by CONTAINER

Note:
The count for DLQ can be ignored. The requestqueue/responsequeue counts are the important containers to keep an eye out for. Request/Response queue counts greater than 100 are eligible for the noOfConsumers property key mentioned below.

Reasons why the ActiveMq Msgs table may show counts that continually increase include:

Scheduled jobs whose run rate is greater than its completion rate. This indicates that new jobs are being started before they complete.
Agents that not able to communicate properly with the Orchestrators - yet the Orchestrators continue to send targeted jobs to those agents.

The second scenario can be identified by messages in the Orchestrator's c2o.log. Example:

INFO [com.optinuity.c2o.transport.Resolver] [_autoTPRecovery] Transport properties [TransportID=agentNode, Hostname=<hostname>, IPAddress=<ip_address>, Port=7003, IsSecure=false] of node a999b6e3-8862-4d0e-811c-e91876d7e501 not reachable

Resolution

Depending on the cause of the bottleneck, make the necessary adjustments. Based on the reasons described in the "Cause" section, the solutions are as follows:

Scheduled jobs executing before previous scheduled jobs complete:

If you know that your scheduled jobs take longer to complete than the frequency in which they are started, increase the frequency in which they are scheduled to run so that new processes start after previously scheduled processes complete.

Agents that not able to communicate properly with the Orchestrators:

Address the communication problem between the agent and the orchestrator. Example of problems/solutions:

Agent is stopped: start the agent.
Agent is using a network interface that cannot communicate with the orchestrator:
- Stop the agent, temporarily disable the interface that the ITPAM Agent should not be using, Start the agent, re-enable the disabled interface; or
- configure the agent to bind to the ip address that can communicate with the orchestrator using the following KB: Binding an agent to an ip address
Ports are blocked. Work with your network team to ensure the appropriate ports are open between the agent and orchestrator. The following pages can help:
- Port Planning Prerequisites
- Ports Used by CA Process Automation