Reconnect of Autosys services afterDB RDS Instance failover

book

Article ID: 208087

calendar_today

Updated On:

Products

CA Workload Automation AE

Issue/Introduction

We have recently upgrade Autosys applicationserver and for production application, we have High Availability.
Database is oracle on Amazon RDS instance with multi AZ replication enabled.
When RDS instance failover from one AZ to another AZ, application is not connecting by default to the new RDS instance.
To establish connection , it requires to restart autosys.

During RDS  failover, DNS name remains same, end point remains same. IP changes.
Is there a way autosys recover automatically without requiring a restart ?

Environment

Release : 11.3.6
OS : Linux
DB : Oracle on Amazon RDS
Component : CA Workload Automation AE (AutoSys)

Resolution

There is a fix, but you will need to make sure you are running 11.3.6 SP8 Cumulative 1 before applying this fix.

  • If you are not running 11.3.6 SP8 CUM1, please download and install solution SO12221.
  • Please open a case with support and request T-Fix T42B424.
    Note: You'll want to provide the event_demon and as_server logs so that support can confirm your issue is related. 

Fix T42B424 Notes:
In case of Oracle client hang, the scheduler's database monitoring system cannot reconnect to the database while a transaction is in progress and scheduler does not show any symptoms of hang as the database monitoring system does not write any information in the log.
                 
With this patch, a new configuation variable WaitTriesForBlockedOracleClientBeforeAbort needs to be added to $AUTOUSER/config.$AUTOSERV file. 
Default value will be 3.
                 
When the WaitTriesForBlockedOracleClientBeforeAbort=3, in case the database monitoring thread is hung, Scheduler will produce below messages three times:
                 
CAUAJM_W_10643 The database monitoring system cannot reconnect to the database while a transaction is in progress. Waiting for the database transaction to complete.  Trying again...
                 
After three time, scheduler will produce below message along with notification and abort, abort will make sure shadow will take over.
                 
CAUAJM_E_10641 Trouble reconnecting to database, database monitoring system cannot continue any longer.  Exiting!