Scheduler or Application Server is stuck in Oracle database re connection
search cancel

Scheduler or Application Server is stuck in Oracle database re connection

book

Article ID: 110166

calendar_today

Updated On:

Products

CA Workload Automation AE - Scheduler (AutoSys) Autosys Workload Automation

Issue/Introduction

Something happened in the network, then the scheduler failed to reconnect to the Oracle database and is no longer processing events

The as_server log file shows:
ORA-03126: network driver does not support non-blocking operations

Then event_demon log file shows:
07/21/2018 15:31:30] CAUAJM_E_18412 The database client has been interrupted while query execution is in progress. 
[07/21/2018 15:31:30] CAUAJM_E_18400 An error has occurred while interfacing with ORACLE. 
[07/21/2018 15:31:30] CAUAJM_W_10900 The database monitoring system has detected a potential problem with the database. 
[07/21/2018 15:31:30] CAUAJM_E_18401 Function <doExecute> invoked from <nextRow> failed <951> 
[07/21/2018 15:31:30] CAUAJM_I_10901 The database monitoring system is beginning validation of database connections. 
[07/21/2018 15:32:00] ---------------------------------------- 
[07/21/2018 15:33:00] ---------------------------------------- 
[07/21/2018 15:33:00] CAUAJM_E_18416 Event Server: <XE> Failed Query: <update ujo_alamode set int_val = (select int_val from ujo_alamode where type='DB') where type = 'DB'> 
[07/21/2018 15:33:00] CAUAJM_E_18412 The database client has been interrupted while query execution is in progress. 
[07/21/2018 15:33:00] CAUAJM_E_18400 An error has occurred while interfacing with ORACLE. 
[07/21/2018 15:33:00] CAUAJM_E_18401 Function <doExecute> invoked from <execute> failed <864> 
[07/21/2018 15:33:00] CAUAJM_W_10631 Error with database <XE>. Checking connection. 
[07/21/2018 15:33:00] CAUAJM_W_10632 Attempting to reconnect to database <XE>. Attempt number [1]. 
[07/21/2018 15:34:00] ---------------------------------------- 
[07/21/2018 15:35:00] ---------------------------------------- 
[07/21/2018 15:36:00] ---------------------------------------- 
[07/21/2018 15:37:00] ---------------------------------------- 
[07/21/2018 15:38:00] ---------------------------------------- 

A normal shutdown with "unisrvcntr stop waae_sched.<instance>" is unable to kill the process with the default signal 15

Only way to recover is to kill the event_demon process with signal 9 and restart the scheduler

Same behavior can happen with the Application Server
 

Environment

CA WAAE  (Any Release)  on Unix/Linux

The error can occur with the scheduler and application server 

 

Cause


The root cause appears to be a hang occurring within the Oracle library client.
The database monitoring system was in the process of executing Oracle client library calls to disconnect and reconnect to the database server.
The Oracle client library is known to steal control from the calling process and hang for a long time. 

Oracle has a known issue in this area: Bug 19591551 Unable to timeout or break OCI calls if client disconnects 
 

Resolution


There is a workaround to the known Oracle bug that might help: 
Please source the $AUTOUSER/autosys.sh.<hostname> script at a UNIX shell prompt.
Then locate the $ORACLE_HOME/network/admin/sqlnet.ora file on all machines that have a scheduler and application server installation and add the following variables: 

SQLNET.RECV_TIMEOUT=120
SQLNET.OUTBOUND_CONNECT_TIMEOUT=120
SQLNET.INBOUND_CONNECT_TIMEOUT=120 

If a faster recovery is required, you can decrease these values from 120 seconds to 10 seconds

Then restart the schedulers and application servers so that the Oracle client picks up the variables.
The correct value for ORACLE_HOME that is used by the schedulers and application servers should be set on the console after sourcing the $AUTOUSER/autosys.sh.<hostname> script.