Siteminder Policy Server Failover delay to Oracle Policy store (same applies for user Store)
Use case:
- Policy store / key Store (same DB Instance) running on Oracle 19c
- The Defined Data source for the Policy Store is using AWS load balancer as the host for the DB which takes care of sending the traffic to a primary DB and in case of a failover sends to the backup DB
- when primary DB is taken down, The AWS load balancer then failed over to the secondary DB
- From Siteminder side , it was observed that no failover or Failback messages reported in the smps log / trace for the Data sources (same data source in Failover mode)
- Policy server seems to recover after 5 min or so.
- some of the transactions were successful and others were not
Release : 12.8.x
########### Troubleshooting and Analysis #############
#### Testing Results with out of the box Driver settings
- Siteminder had the following Registry keys set
LoginTimeout= 0xf; REG_DWORD --> 15 seconds
QueryTimeout= 0x1e; REG_DWORD --> 30 seconds
ConnectionTimeout= 0x41; REG_DWORD --> 65 seconds
- From the Policy server trace and driver snoop logs (see end of the article on how to enable ODBC Driver snoop logs) , it was observed that the delay is happening on the already established handles (connections to Store prior of DB Failover) were 3 Retry Attempts on the Already established handle which took 1 min and 30 seconds each to report the -1 before giving up finally and creating a new handle.
below is the Isolated traces from policy server and ODBC snoop showing the issue. The ODBC Driver is not honoring the 30 seconds querytimeout passed by Siteminder
** From odbc trace , this was on SQLExecute where call started at 13:07:53 and did not return till 13:09 from Driver reported -1 which is exactly 1min and 30 seconds before Driver gave up
Line 120476: 2022-03-28 13:07:53.307364 ppid=29829:pid=29831:tid=140692793849600 ENTER SQLExecute
Line 120478: 2022-03-28 13:07:53.307364 ppid=29829:pid=29831:tid=140692793849600 HSTMT 0x007ff58c001690
Line 121812: 2022-03-28 13:09:23.382440 (Time elapsed in microseconds : 90075076) ppid=29829:pid=29831:tid=140692793849 EXIT SQLExecute with return code -1 (SQL_ERROR)
Line 121814: 2022-03-28 13:09:23.382440 (Time elapsed in microseconds : 90075076) ppid=29829:pid=29831:tid=140692793849 HSTMT 0x007ff58c001690
** Then we can see at [13:10:53.470] , the thread again reported a network issue
[03/28/2022][13:10:53.470][13:10:53][29831][140692793849600][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Error.]
[03/28/2022][13:10:53.470][13:10:53][29831][140692793849600][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Server BUSY or State with Error, so Retrying]
[03/28/2022][13:10:53.470][13:10:53][29831][140692793849600][CSmDbODBC.cpp:1555][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][][][][][][][][][][SQL Still Executing (Not expected if using syncronous connections.]
IF we look at the ODBC trace , we can see that actually at 13:09:23 after the first -1 above , we can see a retry apparently for the Execution which also took another 1 min 30 seconds to return - from driver at 13:10:53.470058
Line 121880: 2022-03-28 13:09:23.383173 ppid=29829:pid=29831:tid=140692793849600 ENTER SQLExecute
Line 121882: 2022-03-28 13:09:23.383173 ppid=29829:pid=29831:tid=140692793849600 HSTMT 0x007ff58c001690
Line 132508: 2022-03-28 13:10:53.470058 (Time elapsed in microseconds : 90086885) ppid=29829:pid=29831:tid=140692793849 EXIT SQLExecute with return code -1 (SQL_ERROR)
Line 132510: 2022-03-28 13:10:53.470058 (Time elapsed in microseconds : 90086885) ppid=29829:pid=29831:tid=140692793849 HSTMT 0x007ff58c001690
** And yet Again another pause in our Trace before we see another Network Error reported at [13:12:23.539 .
[03/28/2022][13:12:23.539][13:12:23][29831][140692793849600][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Error.]
[03/28/2022][13:12:23.539][13:12:23][29831][140692793849600][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Server BUSY or State with Error, so Retrying]
[03/28/2022][13:12:23.539][13:12:23][29831][140692793849600][CSmDbODBC.cpp:1555][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][][][][][][][][][][SQL Still Executing (Not expected if using syncronous connections.]
If you look at the ODBC trace , you can see that there was another SQLExecute that kicked in at 13:10:53 and one more time took 1min and 30 seconds to return the -1 from Driver side at [13:12:23.539]
Line 132576: 2022-03-28 13:10:53.470717 ppid=29829:pid=29831:tid=140692793849600 ENTER SQLExecute
Line 132578: 2022-03-28 13:10:53.470717 ppid=29829:pid=29831:tid=140692793849600 HSTMT 0x007ff58c001690
Line 163128: 2022-03-28 13:12:23.538768 (Time elapsed in microseconds : 90068051) ppid=29829:pid=29831:tid=140692793849 EXIT SQLExecute with return code -1 (SQL_ERROR)
** also another pause in our Trace which is exactly the same thing as above due to another sqlExecute that took 1min and 30 seconds before driver reported -1
[03/28/2022][13:13:53.631][13:13:53][29831][140692793849600][CSmDbODBC.cpp:1973][][][][][][][][][][][][][][][][][][][][][][LogMessage:INFO:[sm-log-00000] Execution time exceeded threshold. (CSmDbConnectionODBC::DoExecute, 360323, 5000, )]
Line 163196: 2022-03-28 13:12:23.554194 ppid=29829:pid=29831:tid=140692793849600 ENTER SQLExecute
Line 163198: 2022-03-28 13:12:23.554194 ppid=29829:pid=29831:tid=140692793849600 HSTMT 0x007ff58c001690
Line 189560: 2022-03-28 13:13:53.631082 (Time elapsed in microseconds : 90076888) ppid=29829:pid=29831:tid=140692793849 EXIT SQLExecute with return code -1 (SQL_ERROR)
** Over here , we can see Finally we were able to Inactivate the connection and then if you look at the same ODBC trace thread , we can see a new Connection being opened
[03/28/2022][13:14:08.633][13:14:08][29831][140692793849600][CSmDbODBC.cpp:210][CSmDbConnectionODBC::MapResult][][][][][][][][][][][][][][][][][][][][][Invalid handle '140692592531088'.]
[03/28/2022][13:14:08.633][13:14:08][29831][140692793849600][CDb.cpp:285][CSmRecordset::DoSelect][][][][][][][][][][][][][Exception][Internal Error: DB Transport Error. Error code -4017 Native code -1][][][][][][][Finish processing SQL statement.]
[03/28/2022][13:14:08.633][13:14:08][29831][140692793849600][CSmDbODBC.cpp:4367][CSmDbConnectionODBC::DoClose][][][][][][][][][][][][][][][][][][][][][Invalid Statement handle '140692592531088'.]
[03/28/2022][13:14:08.633][13:14:08][29831][140692793849600][CSmDbODBC.cpp:210][CSmDbConnectionODBC::MapResult][][][][][][][][][][][][][][][][][][][][][Invalid handle '140692592531088'.]
[03/28/2022][13:14:08.633][13:14:08][29831][140692793849600][CSmDbODBC.cpp:3397][CSmDbConnectionODBC::AssignConnection][][][][][][][][][][][][][][][][][][][][][Invalid Statement handle '140692592531088'.]
[03/28/2022][13:14:08.633][13:14:08][29831][140692793849600][CSmDbConnection.cpp:241][CSmDbConnection::ReleaseReference][][][][][][][][][][][][][][][][][][][][][Release connection reference. ReferenceCount = '7']
[03/28/2022][13:14:08.633][13:14:08][29831][140692793849600][CSmDbConnection.cpp:200][CSmDbConnection::MakeInactive][][][][][][][][][][][][][][][][][][][][][Inactivate connection.]
here is the snippet from ODBC
Line 197492: 2022-03-28 13:14:08.634537 ppid=29829:pid=29831:tid=140692793849600 ENTER SQLAllocHandle
Line 197494: 2022-03-28 13:14:08.634537 ppid=29829:pid=29831:tid=140692793849600 SQLSMALLINT 3 <SQL_HANDLE_STMT>
Line 197496: 2022-03-28 13:14:08.634537 ppid=29829:pid=29831:tid=140692793849600 SQLHANDLE 0x007ff5cc000ef0
Line 197498: 2022-03-28 13:14:08.634537 ppid=29829:pid=29831:tid=140692793849600 SQLHANDLE * 0x007ff58c0015d0
Line 197500: 2022-03-28 13:14:08.634609 (Time elapsed in microseconds : 72) ppid=29829:pid=29831:tid=140692793849 EXIT SQLAllocHandle with return code 0 (SQL_SUCCESS)
and then Finally the SQLExute succeeded
Line 197548: 2022-03-28 13:14:08.634813 ppid=29829:pid=29831:tid=140692793849600 ENTER SQLPrepareW
Line 197550: 2022-03-28 13:14:08.634813 ppid=29829:pid=29831:tid=140692793849600 HSTMT 0x007ff58c001690
Line 197552: 2022-03-28 13:14:08.634813 ppid=29829:pid=29831:tid=140692793849600 WCHAR * 0x007ff58c00ee58 [4294967293] "SELECT smagentcommandoid, smtimestamp, smcommand, smdata FROM smagentcommand4 WHERE smtimestamp >= 1648412639\x00"
Line 197554: 2022-03-28 13:14:08.634813 ppid=29829:pid=29831:tid=140692793849600 SDWORD -3
Line 197556: 2022-03-28 13:14:08.634886 (Time elapsed in microseconds : 73) ppid=29829:pid=29831:tid=140692793849 EXIT SQLPrepareW with return code 0 (SQL_SUCCESS)
Line 197558: 2022-03-28 13:14:08.634886 (Time elapsed in microseconds : 73) ppid=29829:pid=29831:tid=140692793849 HSTMT 0x007ff58c001690
Line 197560: 2022-03-28 13:14:08.634886 (Time elapsed in microseconds : 73) ppid=29829:pid=29831:tid=140692793849 WCHAR * 0x007ff58c00ee58 [4294967293] "SELECT smagentcommandoid, smtimestamp, smcommand, smdata FROM smagentcommand4 WHERE smtimestamp >= 1648412639\x00"
Line 197562: 2022-03-28 13:14:08.634886 (Time elapsed in microseconds : 73) ppid=29829:pid=29831:tid=140692793849 SDWORD -3
Line 197564: 2022-03-28 13:14:08.634927 ppid=29829:pid=29831:tid=140692793849600 ENTER SQLExecute
Line 197566: 2022-03-28 13:14:08.634927 ppid=29829:pid=29831:tid=140692793849600 HSTMT 0x007ff58c001690
Line 197568: 2022-03-28 13:14:08.643590 (Time elapsed in microseconds : 8663) ppid=29829:pid=29831:tid=140692793849 EXIT SQLExecute with return code 0 (SQL_SUCCESS)
#### Testing Results with InterruptTimeout recommended by Progress DEV Team
- With the Above and After consulting with Progress Team (DataDirect Driver used by Siteminder), it was recommended to use the InterruptTimeout in the Data Source to see if it helps .
The default for query timeout is 0 which means that the query does not time out, but the driver responds to the SQL_ATTR_QUERY_TIMEOUT attribute which is set to 30 seconds. So, QueryTimeout doesn't need to be added to the ODBC data source.
Using the positive value for InterruptTimeout causes the driver to stop waiting for the cancel acknowledgement and return to the application. In the customer's situation, the query timeout is triggered but the database is not available to return a cancel acknowledgement to the driver and that is most likely why it takes so long to respond to the failover.
Using the negative value for InterruptTimeout causes the driver to stop waiting for the cancel acknowledgement and kill the socket, which drops the connection.
InterruptTimeout set to a small positive value, say 5 seconds:
QueryTimeout 30
InterruptTimeout 5
Should return to app in 35 seconds
If that doesn't help, have them set it to a negative value:
InterruptTimeout set to a small negative value, say -5 seconds:
QueryTimeout 30
InterruptTimeout -5
Socket will be killed in 35 seconds (the connection will be gone)
Same test was performed and an improvement was observed in the Failover where this time it was taking 40 seconds this time for each Retry instead of the 90 seconds observed previously, so the InterruptTimeout added and set to 5 in the Data Srource.
However , we should be expecting to have 30 Query timeout + 5 seconds InterruptTimeout = 35 seconds for each retry and yet the results where showing 40 seconds .
05/25/2022][10:33:45.047][10:33:45][1021][139814171031296][CDb.cpp:312][CSmRecordset::Execute][][][][][][][][][][][][][][][][][][][][][Start processing SQL statement.]
[05/25/2022][10:34:25.086][10:34:25][1021][139814171031296][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Error.]
[05/25/2022][10:34:25.086][10:34:25][1021][139814171031296][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Server BUSY or State with Error, so Retrying]
[05/25/2022][10:34:25.086][10:34:25][1021][139814171031296][CSmDbODBC.cpp:1555][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][][][][][][][][][][SQL Still Executing (Not expected if using syncronous connections.]
[05/25/2022][10:35:05.124][10:35:05][1021][139814171031296][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Error.]
[05/25/2022][10:35:05.124][10:35:05][1021][139814171031296][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Server BUSY or State with Error, so Retrying]
[05/25/2022][10:35:45.152][10:35:45][1021][139814171031296][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Error.]
[05/25/2022][10:35:45.152][10:35:45][1021][139814171031296][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][HY000][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Network Operation Timed Out.][][][][][][][SQL Server BUSY or State with Error, so Retrying]
2022-05-25 10:33:45.047961 ppid=1019:pid=1021:tid=139814171031296 ENTER SQLExecute
2022-05-25 10:34:25.086375 (Time elapsed in microseconds : 40038414) ppid=1019:pid=1021:tid=13981417103129 EXIT SQLExecute with return code -1 (SQL_ERROR)
2022-05-25 10:34:25.086741 ppid=1019:pid=1021:tid=139814171031296 ENTER SQLExecute
2022-05-25 10:35:05.124140 (Time elapsed in microseconds : 40037399) ppid=1019:pid=1021:tid=13981417103129 EXIT SQLExecute with return code -1 (SQL_ERROR)
2022-05-25 10:35:05.124572 ppid=1019:pid=1021:tid=139814171031296 ENTER SQLExecute
2022-05-25 10:35:45.151727 (Time elapsed in microseconds : 40027155) ppid=1019:pid=1021:tid=13981417103129 EXIT SQLExecute with return code -1 (SQL_ERROR)
2022-05-25 10:35:45.167326 ppid=1019:pid=1021:tid=139814171031296 ENTER SQLExecute
2022-05-25 10:36:25.203728 (Time elapsed in microseconds : 40036402) ppid=1019:pid=1021:tid=13981417103129 EXIT SQLExecute with return code -1 (SQL_ERROR)
##### Feedback from Progress DEV team on the additional 5 seconds observed after adding the InterruptTimeout
Progress team reproduced the issue in their Lab and determined that the extra 5 seconds taken by InterruptTimeout is a bug that would be addressed in Progress Defect XDBC-37372.
a DEV fix for the Driver was provided where the driver will now return in QueryTimeout+InterruptTimeout with no extra time taken
##### Testing the ODBC Driver DEV fix provided by Progress
After implementing the DEV fix and testing the same failover scenario ,f rom the logs listed below , you can see that before the Driver was attempting 4 times with each time taking 40 seconds (query timeout was set to 30 and the Interrupt timeout was set to 5 so our assumption was 30+ 5 = 35 but it was taking 40 . The additional 5 seconds is the reason Progress DEV team gave us the patch that is supposed to remove the 5 seconds delay)
so we have 4 retries on the query with each taking 40 seconds = 40 + 40 + 40+ 40 = 160 seconds so roughly around 3 min
however , if I look into the new logs , the Transaction is failing in around 13 seconds
If you noticed , the driver on the 3 retries is now taking microseconds to complete .
After Driver Update
-------------------------------------
[09/28/2022][19:22:06.291][19:22:06][14489][140100742665984][CDb.cpp:221][CSmRecordset::DoSelect][][][][][][][][][][][][][][][][][][][][][Start processing SQL statement.]
[09/28/2022][19:22:06.291][19:22:06][14489][140100742665984][CDb.cpp:312][CSmRecordset::Execute][][][][][][][][][][][][][][][][][][][][][Start processing SQL statement.]
[09/28/2022][19:22:19.193][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][08S01][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Connection Dead.][][][][][][][SQL Error.]
[09/28/2022][19:22:19.193][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][08S01][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]Connection Dead.][][][][][][][SQL Server BUSY or State with Error, so Retrying]
[09/28/2022][19:22:19.193][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1555][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][][][][][][][][][][SQL Still Executing (Not expected if using syncronous connections.]
[09/28/2022][19:22:19.194][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][08S01][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]ORA-03114: not connected to ORACLE][][][][][][][SQL Error.]
[09/28/2022][19:22:19.194][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][08S01][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]ORA-03114: not connected to ORACLE][][][][][][][SQL Server BUSY or State with Error, so Retrying]
[09/28/2022][19:22:19.194][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1555][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][][][][][][][][][][SQL Still Executing (Not expected if using syncronous connections.]
[09/28/2022][19:22:19.194][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1472][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][08S01][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]ORA-03114: not connected to ORACLE][][][][][][][SQL Error.]
[09/28/2022][19:22:19.194][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1521][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][08S01][-1][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]ORA-03114: not connected to ORACLE][][][][][][][SQL Server BUSY or State with Error, so Retrying]
[09/28/2022][19:22:19.194][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1555][CSmDbConnectionODBC::CheckForError][][][][][][][][][][][][][][][][][][][][][SQL Still Executing (Not expected if using syncronous connections.]
[09/28/2022][19:22:19.209][19:22:19][14489][140100742665984][CSmDbODBC.cpp:1973][][][][][][][][][][][][][][][][][][][][][][LogMessage:INFO:[sm-log-00000] Execution time exceeded threshold. (CSmDbConnectionODBC::DoExecute, 12918, 5000, agent= client= server= resource= action= user=)]
[09/28/2022][19:22:19.209][19:22:19][14489][140100742665984][CSmDbODBC.cpp:362][CSmDbConnectionODBC::MapResult][][][][][][][][][][][][-1][][[CA SSO][ODBC Oracle Wire Protocol driver][Oracle]ORA-03114: not connected to ORACLE][][][][][][][ODBC Error: State = 08S01 Internal Code = -3114 - s - MappedResult:-4017]
[09/28/2022][19:22:19.209][19:22:19][14489][140100742665984][CSmDbUtilities.cpp:806][CSmDbMonitoredClass::SetState][][][][][][][][][][][][][][][][][][][][][Connection SiteMinder Data Source: Changing object state 'Active' to state 'Hung'.]
[09/28/2022][19:22:19.209][19:22:19][14489][140100742665984][CSmDbODBC.cpp:567][CSmDbConnectionODBC::CancelSQLCall][][][][][][][][][][][][][][][][][][][][][Cancel executed, Connection marked as hung.]
[09/28/2022][19:22:19.215][19:22:19][14489][140100742665984][CDb.cpp:285][CSmRecordset::DoSelect][][][][][][][][][][][][][Exception][Internal Error: DB Transport Error. Code is -4017 (DBMSG: <<<State = 08S01 Internal Code = -3114 - s>>>)][][][][][][][Finish processing SQL statement.]
Line 293434: 2022-09-28 19:22:06.291276 ppid=14486:pid=14489:tid=140100742665984 ENTER SQLExecute
Line 293436: 2022-09-28 19:22:06.291276 ppid=14486:pid=14489:tid=140100742665984 HSTMT 0x007f6ba408f230
Line 293730: 2022-09-28 19:22:19.193357 (Time elapsed in microseconds : 12902081) ppid=14486:pid=14489:tid=140100742665 EXIT SQLExecute with return code -1 (SQL_ERROR)
Line 293798: 2022-09-28 19:22:19.193746 ppid=14486:pid=14489:tid=140100742665984 ENTER SQLExecute
Line 293800: 2022-09-28 19:22:19.193746 ppid=14486:pid=14489:tid=140100742665984 HSTMT 0x007f6ba408f230
Line 293802: 2022-09-28 19:22:19.193780 (Time elapsed in microseconds : 34) ppid=14486:pid=14489:tid=140100742665 EXIT SQLExecute with return code -1 (SQL_ERROR)
Line 293870: 2022-09-28 19:22:19.194084 ppid=14486:pid=14489:tid=140100742665984 ENTER SQLExecute
Line 293872: 2022-09-28 19:22:19.194084 ppid=14486:pid=14489:tid=140100742665984 HSTMT 0x007f6ba408f230
Line 293874: 2022-09-28 19:22:19.194108 (Time elapsed in microseconds : 24) ppid=14486:pid=14489:tid=140100742665 EXIT SQLExecute with return code -1 (SQL_ERROR)
Line 293942: 2022-09-28 19:22:19.209451 ppid=14486:pid=14489:tid=140100742665984 ENTER SQLExecute
Line 293944: 2022-09-28 19:22:19.209451 ppid=14486:pid=14489:tid=140100742665984 HSTMT 0x007f6ba408f230
Line 293946: 2022-09-28 19:22:19.209515 (Time elapsed in microseconds : 64) ppid=14486:pid=14489:tid=140100742665 EXIT SQLExecute with return code -1 (SQL_ERROR)
##### Progress team Feedback on the test results of the DEV fix
We saw a huge improvement in SQL Query failover during the DB Failover after implementing the patch provided by Progress team.
instead of 160 seconds To determine connection is bad (including retries) , now it is taking around 13 seconds .
From the Driver trace 30 seconds in the Query timeout
ENTER SQLSetStmtAttr
SQLHSTMT 0x007f6ba408f230
SQLINTEGER 0 <SQL_ATTR_QUERY_TIMEOUT>
SQLPOINTER 30
before , the driver was taking 40 seconds to mark the SQL Execute Error
2022-05-25 10:33:45.047961 ppid=1019:pid=1021:tid=139814171031296 ENTER SQLExecute
2022-05-25 10:34:25.086375 (Time elapsed in microseconds : 40038414) ppid=1019:pid=1021:tid=13981417103129 EXIT SQLExecute with return code -1 (SQL_ERROR)
And then perform 3 other retries each also taking 40 second
After the Fix , I can see that the SQL Execute is taking 12 seconds to determine the connection is bad and then the other 3 retries are roughly taking 40 microseconds each
Line 293434: 2022-09-28 19:22:06.291276 ppid=14486:pid=14489:tid=140100742665984 ENTER SQLExecute
Line 293436: 2022-09-28 19:22:06.291276 ppid=14486:pid=14489:tid=140100742665984 HSTMT 0x007f6ba408f230
Line 293730: 2022-09-28 19:22:19.193357 (Time elapsed in microseconds : 12902081) ppid=14486:pid=14489:tid=140100742665 EXIT SQLExecute with return code -1 (SQL_ERROR)
So you can see there is a huge difference in performance and marking connection as bad .
Couple of questions came out since expectations was still we are going to have 30+5 seconds as explained above after the fix but it is not the case , it is much faster now.
We sent the below questions to Progress DEV team
1) If the Query timeout is 30 seconds as seen passed from Siteminder code to Driver , why the Driver is returning Error at the 13 seconds mark ? does it now not wait for the 30 seconds to pass and rather now report issue as soon as returned by the Database or tcp layer ?
2) same for the Retries , before it used to take 40 seconds also for each retry , now it is taking around 40 microsecond to report -1 connect error , is it based on the tcp response and no longer wait for the query timeout ?
and the Answer was as follows
Development has responded that it is expected for the driver to return "Connection dead" instead of "Network Operation Timed Out" after the fix.
As part of the fix, the driver now determines whether the connection is dead during the current ODBC call instead of returning the timeout error and then returning "Connection dead" error after a subsequent ODBC call.
This means that the driver returns as soon as it determines that the connection is dead, which is why it returns sooner.
The drivers that can use InterruptTimeout are DB2, Oracle and Sybase ASE and Sybase IQ.
##### Conclusion
Policy server 12.8 SP7 and lower are impacted by the Failover delay.
Please reach out to Broadcom Siteminder support so the updated DataDrirect Driver patch can be provided based on Defect DE531927 .
The same DataDirect fix will be included in the next Siteminder official release
####### Enable DataDirect Driver logs #########
Modify your system_odbc.ini file and go to the end and set the tacing to from Trace=0 to Trace=1 which will enable the tracing
Example
[ODBC]
Trace=1
TraceFile=/opt/CA/siteminder/db/odbctrace.out
TraceDll=/opt/CA/siteminder/odbc/lib/NStrc28.so
InstallDir=/opt/CA/siteminder/odbc
DBCTraceMaxFileSize=204,800
ODBCTraceMaxNumFiles=50
To turn on the internal snoop set the following connection options in the odbc.ini file
EnablePacketLogging = 1
PacketLoggingFlush=1
PacketLoggingFilePrefix=/tmp/DB2_snoop_