Running Gen 8.6 Transaction Enabler (TE) aefad/aefuf on RHEL 8 Linux and connecting to an Oracle database.
When an established Oracle connections fails any Gen server load modules already in memory continue to try to use the failed connection, rather than dropping it and opening a new connection. A failed connection can happen for lots of reasons, such as
Any broken Oracle connection is impacting a critical production business service being run by the TE.
When it is realised that there has been an Oracle problem even after the connection is restored human intervention is required to restart the TE because the existing loaded server load modules fail to execute when reused.
Is it possible for the loaded servers to handle the broken connection and create a new connection without failing perhaps through some reconfiguration or custom programming in user exit code?
Gen Transaction Enabler
Using aefad/aefuf started on ports 3000 and 3001 respectively Support tested a broken Oracle connection scenario in house with the Gen sample model (1 server load module P900) by manually loading 10 copies of P900 using the aefc client command "/load P900
" (Transaction Enabler > AEF Client)
a. Server status.
10 copies of P900 loaded$ ps -fugen86
UID PID PPID C STIME TTY TIME CMD
gen86 1111507 1 0 Dec04 ? 00:00:03 aefad -i 3000 -a 200 -m 200 -p 13 -s 11 -t 31
gen86 1111510 1 0 Dec04 ? 00:00:00 aefuf -i 3001 -c 3000 -t 31
gen86 1222751 1111507 0 17:33 ? 00:00:00 P900 -t 31 000 000
gen86 1222754 1111507 0 17:33 ? 00:00:00 P900 -t 31 000 000
gen86 1222793 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222796 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222805 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222808 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222818 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222830 1111507 0 17:35 ? 00:00:00 P900 -t 31 000 000
gen86 1222841 1111507 0 17:35 ? 00:00:00 P900 -t 31 000 000
gen86 1222911 1111507 0 17:36 ? 00:00:00 P900 -t 31 000 000
gen86 1222844 1222241 0 17:35 pts/8 00:00:00 ps -fugen86
b. Client requests transaction to TE.
One of existing above load modules services the transaction successfully
c. Stop the database (shutdown immediate).
Client request then encounters expected error:
***...
TIRM038E: ** FATAL DATABASE ERROR WAS ENCOUNTERED **
TIRM039E: DB LAST STATUS = DB
--------------------------------------------------------------------------------
ORA-03113: end-of-file on communication channel
Process ID: 31299
Sess
ORA-03114: not connected to ORACLE
...
***
d. Server status.
The number of load modules reduced to 9 because the one servicing the above client request had been ended by the aefad due to the failed DB connection.
***gen86 1111507 1 0 Dec04 ? 00:00:03 aefad -i 3000 -a 200 -m 200 -p 13 -s 11 -t 31
gen86 1111510 1 0 Dec04 ? 00:00:00 aefuf -i 3001 -c 3000 -t 31
gen86 1222754 1111507 0 17:33 ? 00:00:00 P900 -t 31 000 000
gen86 1222793 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222796 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222805 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222808 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222818 1111507 0 17:34 ? 00:00:00 P900 -t 31 000 000
gen86 1222830 1111507 0 17:35 ? 00:00:00 P900 -t 31 000 000
gen86 1222841 1111507 0 17:35 ? 00:00:00 P900 -t 31 000 000
gen86 1222911 1111507 0 17:36 ? 00:00:00 P900 -t 31 000 000
gen86 1223160 1222241 0 17:42 pts/8 00:00:00 ps -fugen86
***
e. Restart database
f. Continue to use client and still receive error.
That is because each of the remaining 9 load modules connected to the DB when it started and that connection is now stale/unusable even though the DB has been restarted. So the failed client request process continues in the same manner until all 10 load module copies have ended. On request #11 a new P900 is loaded which successfully connects to the restarted DB and the client request is successful.
For any loaded Gen server load modules whose initial connection was successful even after the Oracle issue is resolved the load module connection becomes stale/unusable.
Any client requests will continue to use those servers which then fail and are ended by the aefad.
That process will continue until all originally loaded servers have failed/ended. Later requests will then cause a new server to be loaded which successfully connects to the restarted DB and the client request is successful.
In summary, without any intervention, to complete the recovery requires all running server load modules (previously connected successfully to Oracle) to be reused at which point they will fail and are terminated.
There is no runtime user exit available to add code to handle a database connection failure for a load module after its connection is initially successful, so the behaviour cannot be changed.
Instead of waiting for the load module to be used for it to be ended:
/delete 200 SERVER1#enter#
#PF3#
aefc -i aefad_port -p delete_all_playback -b -d 0
To minimise human intervention perhaps some automation script/process could be used to detect the loss of the database connection i.e. that script/process watches for a database failure and after the connection is restored the script restarts the aefad/aefuf or uses the playback script.
This behaviour applies to any target database and not just Oracle.