WP seen as inactive looping on EH record with bind: notfound EH_DESCRIPTION
search cancel

WP seen as inactive looping on EH record with bind: notfound EH_DESCRIPTION

book

Article ID: 246149

calendar_today

Updated On:

Products

CA Automic Workload Automation - Automation Engine

Issue/Introduction

After an upgrade to 12.3.8, some WPs will suddently appear as Non Active in AWI, but at Service Manager level they are active.

When doing a strace on the processes we can see that they are working, but they don't write anything else anymore in their log.

We observed there are the same amount of old records in MQ*WP table that these processes are, ie 2 old lines in MQ2WP and 2 "hung" WPs.

It seems they are processing an old element from table MQ*WP, as a result they use 100% cpu and loop on an EH query that throws the following kind of bind notfound errors:

20220707/074739.304 - U00009909 TRACE: (BINDPAR: EH_AH_Idnr    )
20220707/074739.304 -                    >1803832271<
20220707/074739.304 - SELECT * FROM EH WHERE EH_AH_Idnr = ?
20220707/074739.305 - bind: notfound EH_DESCRIPTION  
20220707/074739.305 - bind: notfound EH_AEVERSION   
20220707/074739.305 - bind: notfound EH_MQSET     
20220707/074739.305 - bind: notfound EH_COMPLRATE   
20220707/074739.305 - bind: notfound EH_PASSPRIO   
20220707/074739.305 - bind: notfound EH_MODCNT    
20220707/074739.305 - bind: notfound EH_AGENTSESSION 
20220707/074739.305 - bind: notfound EH_STORENAME   
20220707/074739.305 - bind: notfound EH_ACTIVATIONTIME
20220707/074739.305 - bind: notfound EH_ERTCALC    
20220707/074739.305 - bind: notfound EH_ERTSTATUS   
20220707/074739.305 - bind: notfound EH_DEPLDESCIDNR 
20220707/074739.305 - bind: notfound EH_MSGMEMID   
20220707/074739.305 - bind: notfound EH_MSGMEMIDLEN  
20220707/074739.306 - bind: notfound EH_WFNAME    
20220707/074739.306 - bind: notfound EH_AUTOQUIT 

When the processes are killed at system level, the issue is passed to another WP that processes the same MQ*WP record

Environment

Release : 12.3.8 and superior

Component : AUTOMATION ENGINE

Cause

Defect

Root cause:

The WP loops because it cannot find information from a EH record, this can be seen enabling the trace tcpip=2,db=4 on the WPs, kill the looping WP and wait until a new WP starts looping.

On this new WP trace file we can see the loop that is processsed by the hanging WPs

This query on EH table fails continously and is repeated continously with the following kind of traces:

20220513/064741.796 - SELECT * FROM EH WHERE EH_AH_Idnr = ?
20220513/064741.797 - bind: notfound EH_DESCRIPTION    
20220513/064741.797 - bind: notfound EH_AEVERSION      
20220513/064741.797 - bind: notfound EH_MQSET          
...   

Resolution

Workaround:

TO BE DONE ONLY WITH AGREEMENT FROM TECHNICAL SUPPORT:

Delete the associated old record from MQ*WP* table being processed by the hung WP and then kill its associated sessions at database level, this will allow WP to resume processing other MQ*WP records and become Active.

Solution:

Update to a fix version listed below or a newer version if available.

Fix version:
Component(s): Automation Engine

Automation.Engine 12.3.9HF1 - Available
Automation.Engine 21.0.4 - Available

Additional Information

Solution details: A problem has been fixed where executing :PSET/:RSET/:XC_VALUESET/:PUBLISH could drive the WP into an endless loop if the task is already deactivated.