Performance issues Jobs take long to start as Server usage for OWP RWP and CP at 100%

Products

CA Automic Workload Automation - Automation Engine

Issue/Introduction

On a heavily used system (340K Job executions per day), it is suddenly noticed that the AE system has some performance issues during a particular timeframe (on this case between 8H30 and 10H15), that could be noticed because some Jobs within workflows having as Agent a Host Group with a "Maximum Paralell Tasks" set, would take several seconds(up to 30s) to start after the predecessor Job had completed.

The system is composed of 2 AE Virtual Machines (each with 2CPs and 18 WPs) and an Oracle remote database.

During this timeframe we could see:

- One CPs constantly at 100% usage in the AWI

- One WP (OWP Role) constantly at 100% usage .

- One WP (RWP Role) with a high usage.

The following kind of lines appear in the related busy process log:

20220616/095632.677 - U00003375 Server usage of the last minute '100%', the last 10 minutes '99%' and the last hour '99%'.
20220616/095632.677 - U00003343 Server 'AENAME#WP009' processes roles 'O'.

Example of usage taken during another occurrence less severe:

When looking at the Database section in Administration - Automation Engine Management, we could observe that MQOWP table seem to always have an important amount of records and during the problematic timeframe, the count increases:

Both the impacted CP and WP were on the same host.

We detected no issue on the system side, no CPU increase, so the performance issue was purely UC4 internal.

Where do we have to look to diagnose this issue as the CP and WP logs don't show any clear error messages?

Environment

Release : 12.3.x

Component : AUTOMATION ENGINE

Cause

Investigation:

In order to troubleshoot such cases, it is required the trace file of the problematic WP during that timeframe in order to check which AE Routines or DB transactions are long.

On this case, surprisingly, there were no messages U00003524 or U00003434 that usually point us in the direction of the impacted DB transaction or routine.

On the other hand, we could see that during a 5 minutes trace the OWP would process 2061 records from the table MQ1OWP, out of them 2002 were CHKPARA ( check for max. parallel routine), example of messages that could be found by looking for RCV in the WP associated *trc* log:

20220617/084912.867 - JPEXEC_R    RCV  CHKPARA  frm AESYSTEM#WP024     MQWP     MsgID: 0423296224 c-acv: 00000000

Support requested one of our DB experts from Engineering to take a look at the OWP trace, and these were his findings:

I looked into one CHKPARA (check for max. parallel tasks) routine, focusing on the five queries that are performed agains the database to check the time they took:

20220617/085416.588 - UCUDB32 SLCO RET 0000 HSTMT: 0x00000008013da0 VALUE: 0x00000008013da0 ALL:  0.00977 DB:  0.00965 ODBC:  0.00002 UDB:  0.00010
20220617/085416.615 - U00009909 TRACE: (DB-DATEN: MQOWP_Msg         )                                          0x353e3c0 000026
20220617/085416.616 - UCUDB32 READ RET 0000 HSTMT: 0x00000008013da0 VALUE: 0x00000000000001 ALL:  0.02765 DB:  0.02747 ODBC:  0.00000 UDB:  0.00018
20220617/085416.616 - JPEXEC_R              RCV  CHKPARA  frm CS_P123#WP035                             MQWP     MsgID: 0423298322 c-acv: 00000000
20220617/085416.628 - UCUDB32 SLUC RET 0000 HSTMT: 0x00000008013da0 VALUE: 0x00000000000032 ALL:  0.01130 DB:  0.01119 ODBC:  0.00001 UDB:  0.00010
20220617/085416.639 - UCUDB32 EXEC RET 0000 HSTMT: 0x00000008013da0 VALUE: 0x00000000000001 ALL:  0.00992 DB:  0.00985 ODBC:  0.00001 UDB:  0.00006
20220617/085416.652 - UCUDB32 CMIT RET 0000 HSTMT:            (nil) VALUE:            (nil) ALL:  0.01335 DB:  0.01331 ODBC:  0.00000 UDB:  0.00003

As we can see above, it took 27ms to fetch 26 bytes from the DB and then it took 13ms for commit (only 2 bytes) --> this is way too slow!
Customer must check with their infrastructure team as maybe they have a network issue or some other software is using all the available bandwidth during that time frame...

We had a similar situation from another customer that would have a similar situation and found out every time an application moves TB of data through the same network/switch they would have a similar slowness...

This could also apply to your case, please forward that analysis to your teams, maybe some backups or huge data transfers are occurring around the same timeframe and that would explain your performance issues.

Resolution

Root cause:

On this particular case, it went in the direction of what the DB expert had told.

A Network performance degradation was found between the AE server and the Oracle database during the timeframe when the performance issue with the AE occurred.

Looking deeply on the Network side it was found that one of the Network Interfaces of the ESX hosting the AE server was saturated, as expected it was the interface used by the AE server.

Solution:

The problematic AE VM was moved to another ESX and it was immediately seen a correct Response Time in the communication between the AE and the DB, and no more high processing of OWPs and WPs.