Gen Transaction Enabler database connection failure handling
search cancel

Gen Transaction Enabler database connection failure handling

book

Article ID: 385324

calendar_today

Updated On:

Products

Gen Gen - Run Time Distributed

Issue/Introduction

Running Gen 8.6 Transaction Enabler (TE) aefad/aefuf on RHEL 8 Linux and connecting to an Oracle database.
 
When an established Oracle connections fails any Gen server load modules already in memory continue to try to use the failed connection, rather than dropping it and opening a new connection. A failed connection can happen for lots of reasons, such as

  • TCP idle timeout on firewall
  • SQL session idle timeout on Oracle DB
  • DB fails over
  • DB restart

Any broken Oracle connection is impacting a critical production business service being run by the TE.
When it is realised that there has been an Oracle problem even after the connection is restored human intervention is required to restart the TE because the existing loaded server load modules fail to execute when reused.
Is it possible for the loaded servers to handle the broken connection and create a new connection without failing perhaps through some reconfiguration or custom programming in user exit code?

Environment

Gen Transaction Enabler

Resolution


Scenario recreation

Using aefad/aefuf started on ports 3000 and 3001 respectively Support tested a broken Oracle connection scenario in house with the Gen sample model (1 server load module P900) by manually loading 10 copies of P900 using the aefc client command "/load P900" (Transaction Enabler > AEF Client)

a. Server status.
10 copies of P900 loaded
$ ps -fugen86
UID          PID    PPID  C STIME TTY          TIME CMD
gen86   1111507       1  0 Dec04 ?        00:00:03 aefad -i 3000 -a 200 -m 200 -p 13 -s 11 -t 31
gen86   1111510       1  0 Dec04 ?        00:00:00 aefuf -i 3001 -c 3000 -t 31
gen86   1222751 1111507  0 17:33 ?        00:00:00 P900 -t 31 000 000
gen86   1222754 1111507  0 17:33 ?        00:00:00 P900 -t 31 000 000
gen86   1222793 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222796 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222805 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222808 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222818 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222830 1111507  0 17:35 ?        00:00:00 P900 -t 31 000 000
gen86   1222841 1111507  0 17:35 ?        00:00:00 P900 -t 31 000 000
gen86   1222911 1111507  0 17:36 ?        00:00:00 P900 -t 31 000 000
gen86   1222844 1222241  0 17:35 pts/8    00:00:00 ps -fugen86

b. Client requests transaction to TE.
One of existing above load modules services the transaction successfully

c. Stop the database (shutdown immediate).
Client request then encounters expected error:
***
...         
TIRM038E:   ** FATAL DATABASE ERROR WAS ENCOUNTERED **                          
TIRM039E:                      DB LAST STATUS =      DB                         
--------------------------------------------------------------------------------
ORA-03113: end-of-file on communication channel                                 
Process ID: 31299                                                               
Sess                                                                            
ORA-03114: not connected to ORACLE                                              
...   
***    

d. Server status.
The number of load modules reduced to 9 because the one servicing the above client request had been ended by the aefad due to the failed DB connection.
***
gen86   1111507       1  0 Dec04 ?        00:00:03 aefad -i 3000 -a 200 -m 200 -p 13 -s 11 -t 31
gen86   1111510       1  0 Dec04 ?        00:00:00 aefuf -i 3001 -c 3000 -t 31
gen86   1222754 1111507  0 17:33 ?        00:00:00 P900 -t 31 000 000
gen86   1222793 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222796 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222805 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222808 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222818 1111507  0 17:34 ?        00:00:00 P900 -t 31 000 000
gen86   1222830 1111507  0 17:35 ?        00:00:00 P900 -t 31 000 000
gen86   1222841 1111507  0 17:35 ?        00:00:00 P900 -t 31 000 000
gen86   1222911 1111507  0 17:36 ?        00:00:00 P900 -t 31 000 000
gen86   1223160 1222241  0 17:42 pts/8    00:00:00 ps -fugen86
***

e. Restart database

f. Continue to use client and still receive error.
That is because each of the remaining 9 load modules connected to the DB when it started and that connection is now stale/unusable even though the DB has been restarted. So the failed client request process continues in the same manner until all 10 load module copies have ended. On request #11 a new P900 is loaded which successfully connects to the restarted DB and the client request is successful.

 

Summary of behaviour

For any loaded Gen server load modules whose initial connection was successful even after the Oracle issue is resolved the load module connection becomes stale/unusable.
Any client requests will continue to use those servers which then fail and are ended by the aefad.
That process will continue until all originally loaded servers have failed/ended. Later requests will then cause a new server to be loaded which successfully connects to the restarted DB and the client request is successful.
In summary, without any intervention, to complete the recovery requires all running server load modules (previously connected successfully to Oracle) to be reused at which point they will fail and are terminated.
There is no runtime user exit available to add code to handle a database connection failure for a load module after its connection is initially successful, so the behaviour cannot be changed.

 

Options to handle the behaviour

Instead of waiting for the load module to be used for it to be ended:

  • The aefad/aefuf can be restarted.

  • Alternatively the aefc client can be used to manually delete a certain number of loaded load modules ("/delete n load_module") irrespective of their active/inactive state. A playback script file can also be created to do this (UNIX and Linux Implementation Toolset > Testing and Running Applications). For example:
    The file delete_all_playback could contain the following lines and the first line could be repeated for all load module names used:
    /delete 200 SERVER1#enter#

    #PF3#
    The value of aefad parameter -a (maximum number of Gen application load modules that aefad will load into memory) e.g. 200 should be used for the delete. It does not matter if the number specified to be deleted is greater than the actual number loaded at the time.
    Then use this command to run the playback script:
    aefc -i aefad_port -p delete_all_playback -b -d 0
    That would clean up all current load modules without needing to restart the aefuf/aefad.
    In some ways the playback script is cleaner than the aefad/aefuf restart with less impact because aefad/aefuf processes remain running and it is only clearing out the loaded modules. However as client/server connections to the aefad/aefuf are non-persistent (client-server TCP sockets close after each transaction) that benefit is fairly small and only for the aefad/aefuf processes themselves & the respective TCP socket connections between them.

To minimise human intervention perhaps some automation script/process could be used to detect the loss of the database connection i.e. that script/process watches for a database failure and after the connection is restored the script restarts the aefad/aefuf or uses the playback script.

Additional Information

This behaviour applies to any target database and not just Oracle.