After an upgrade to 12.3.8, some WPs will suddently appear as Non Active in AWI, but at Service Manager level they are active.
When doing a strace on the processes we can see that they are working, but they don't write anything else anymore in their log.
We observed there are the same amount of old records in MQ*WP table that these processes are, ie 2 old lines in MQ2WP and 2 "hung" WPs.
It seems they are processing an old element from table MQ*WP, as a result they use 100% cpu and loop on an EH query that throws the following kind of bind notfound errors:
20220707/074739.304 - U00009909 TRACE: (BINDPAR: EH_AH_Idnr ) 20220707/074739.304 - >1803832271< 20220707/074739.304 - SELECT * FROM EH WHERE EH_AH_Idnr = ? 20220707/074739.305 - bind: notfound EH_DESCRIPTION 20220707/074739.305 - bind: notfound EH_AEVERSION 20220707/074739.305 - bind: notfound EH_MQSET 20220707/074739.305 - bind: notfound EH_COMPLRATE 20220707/074739.305 - bind: notfound EH_PASSPRIO 20220707/074739.305 - bind: notfound EH_MODCNT 20220707/074739.305 - bind: notfound EH_AGENTSESSION 20220707/074739.305 - bind: notfound EH_STORENAME 20220707/074739.305 - bind: notfound EH_ACTIVATIONTIME 20220707/074739.305 - bind: notfound EH_ERTCALC 20220707/074739.305 - bind: notfound EH_ERTSTATUS 20220707/074739.305 - bind: notfound EH_DEPLDESCIDNR 20220707/074739.305 - bind: notfound EH_MSGMEMID 20220707/074739.305 - bind: notfound EH_MSGMEMIDLEN 20220707/074739.306 - bind: notfound EH_WFNAME 20220707/074739.306 - bind: notfound EH_AUTOQUIT
When the processes are killed at system level, the issue is passed to another WP that processes the same MQ*WP record
Release : 12.3.8 and superior
Component : AUTOMATION ENGINE
Defect
Root cause:
The WP loops because it cannot find information from a EH record, this can be seen enabling the trace tcpip=2,db=4 on the WPs, kill the looping WP and wait until a new WP starts looping.
On this new WP trace file we can see the loop that is processsed by the hanging WPs
This query on EH table fails continously and is repeated continously with the following kind of traces:
20220513/064741.796 - SELECT * FROM EH WHERE EH_AH_Idnr = ?
20220513/064741.797 - bind: notfound EH_DESCRIPTION
20220513/064741.797 - bind: notfound EH_AEVERSION
20220513/064741.797 - bind: notfound EH_MQSET
...
TO BE DONE ONLY WITH AGREEMENT FROM TECHNICAL SUPPORT:
Delete the associated old record from MQ*WP* table being processed by the hung WP and then kill its associated sessions at database level, this will allow WP to resume processing other MQ*WP records and become Active.
Update to a fix version listed below or a newer version if available.
Fix version:
Component(s): Automation Engine
Automation.Engine 12.3.9HF1 - Available
Automation.Engine 21.0.4 - Available
Solution details: A problem has been fixed where executing :PSET/:RSET/:XC_VALUESET/:PUBLISH could drive the WP into an endless loop if the task is already deactivated.