PWP freezes while a DB Node fails over for 15 minutes with ORA-03113: end-of-file on communication channel
search cancel

PWP freezes while a DB Node fails over for 15 minutes with ORA-03113: end-of-file on communication channel

book

Article ID: 263002

calendar_today

Updated On:

Products

CA Automic Workload Automation - Automation Engine

Issue/Introduction

After having migrated the Database Infrastructure to a different Veritas Cluster called IDG, when a node of the Database RAC is failed over to a different one, Automic processes (PWP and others) are unable to perform a new query as the previous one is never ended / aborted due to the DB failover. As a result, users can no longer use the AE and Jobs stop being launched during this period.

Only after around 15 minutes, the AE processes are notified that the communication with Oracle had been lost and then everything resumes working as normal.

The difference with the previous Dabase Cluster is that the IP of the Database is now given by a Global Traffic Manager (GTM), before it was a virtual IP given by the Veritas Cluster.

As a workaround, a stop of the PWP killing the process id allows to resume normal behavior as that way a new DB connection is attempted.

Is there a way to make faster the failover of the DB Node so that the AE processes are notified in a timely manner?

Environment

Release : 12.x and 21.x

Component: Automation Engine

Environment: Oracle Database RAC with a Veritas Cluster

Cause

Issue caused by the Oracle Database configuration, somehow in the new environment the failover of a member of the database does not cancel all ongoing transactions to the remote Automation Engine proceses (WPs and CPs) as it should.

This could be seen in the PWP traces with tcpip=2,db=4, as we can see below, it takes over 15 minutes for the select query to be cancelled due to ORA-03113, certainly by a network socket closure as the connection was against a DB node that failover:

20230123/151106.751 - SELECT * FROM (SELECT  * FROM MQ1PWP WHERE MQPWP_SchedTime <= sys_extract_utc(systimestamp(0)) ORDER BY MQPWP_Priority, MQPWP_SchedTime, MQPWP_PK) WHERE rownum <= 1
20230123/152635.073 - U00029108 UCUDB: SQL_ERROR    Database handles  DB-HENV: 2484870  DB-HDBC: 2522ea0
20230123/152635.073 - U00003591 UCUDB - DB error info: OPC: 'OCIStmtExecute' Return code: 'ERROR'
20230123/152635.073 - U00003592 UCUDB - Status: '' Native error: '3113' Msg: 'ORA-03113: end-of-file on communication channel
Process ID: 1891303
Session ID: 1884 Serial number: 25459'
20230123/152635.073 - U00003536 UCUDB: FATAL DATA BASE ERROR: Re-connection will be attempted in 10 seconds.
20230123/152635.073 - UCUDB32 CLST RET 0000 HSTMT: 0x000000024b3500
20230123/152635.073 - U00003537 UCUDB - RECONNECT: DB call 'OCITransRollback': Return code: '-1'.
20230123/152635.073 - U00003590 UCUDB - DB error: 'OCITransRollback', 'ERROR   ', '', 'ORA-03114: not connected to ORACLE'
20230123/152635.073 - U00003592 UCUDB - Status: '' Native error: '3114' Msg: 'ORA-03114: not connected to ORACLE'
20230123/152635.183 - U00003538 UCUDB: Re-connection to database successful. Processing will continue.

Resolution

In order to fix the issue, the parameter tcp_retries2 as suggested here.

It was set to 15 and that corrresponds to a dead timeout connection of 14 minutes, by reducing it to 8 that corresponds to 51 seconds, the automatic reconnection of the Automation Engine processes to the Database while a Node Failover is performed takes now less than a minute.