This issue is still ongoing. We just experienced it again this week (on Windows side) and needed to restart TDM portal service again. The number of sessions "INACTIVE" on the DB reached more than 1200. Our limit on the DB server is set to 2000, but looks like when it reaches 1200 TDM Portal tends to get unstable and we are even getting connection to the DB issues also.
Once the TDM Portal service gets restarted, everything gets back to normal until the issue comes back.
Release : 4.9.1
Component : TDM Web Portal
I've tested out a non-kubernetes Windows TDM machine and a Unix TDM machine with these settings:
Configuration in applications.properties:
spring.datasource.tomcat.validation-query=select 1 from dual
Configuration in sqlnet.ora on Oracle server:
SQLNET.EXPIRE_TIME=3 # Minutes to wait till Oracle server tests inactive sessions
Tomcat will keep 5 connections open and active in the pool at all times. If the pool connections are idle for more than 10000 milli-seconds the validation-query will be executed. If no response is received within 10 seconds the connection will be marked as bad and flagged for closure. The time-between-eviction-runs-millis should be set low enough so connections are not seen as inactive by any firewalls or other network processes.
Connections > 5 (up to the max-idle if set), will be eligible for closure after 60 seconds. This prevents stale connections from causing data update timeouts.
The preceding tomcat configuration can result in stale/dead server-side connections if the proper close packets do not reach the database. To prevent this the SQLNET.EXPIRE_TIME should be set in the sqlnet.ora file. If EXPIRE_TIME is set to 3 minutes, for example, database server-side connections will be tested after 3 minutes of inactivity and closed after an additional minute of retry attempts (retries every 6 seconds up to 10 times). This will prevent bad connections from accumulating and destabilizing the Oracle server.
In my testing I killed the connections between the tomcat and oracle server in a variety of ways:
Send FIN packets to each side, properly terminating the sessions.
Disconnect one side and with no packets sent to the other side.
Simulated physical wire disconnect (i.e. neither side receives termination packets).
In each case connections were cleaned up with the above settings.
With those settings turned-off stale connections would accumulate and in some cases timeout minutes or hours later. The behavior depended on the state of the connection (i.e. was it idle or in the middle of a transaction). Idle connections would always hang, sometimes indefinitely. Terminated connections processing transactions would frequently clean up, but could still result in a stale session on one side or the other.
These settings were originally created to address issues in the Kubernetes test instances. Tomcat connection pool sessions were being terminated incorrectly. Additionally, the TCP close packets that normally finalize a session end were being filtered out in some cases. The above settings resolved that issue (though I'm not sure the SQLNET.ORA setting was put in place)
Because it forces the tomcat server to actively manage and maintain all the connections the additional application.settings parameters might prevent Oracle-side inactive sessions. It depends, though, on the root cause of the disconnection. In the K8 case we never found the specific along-the-wire network issue that caused the tomcat and oracle server connections to go bad. The logs themselves would only show statement timeouts long after the issue. There was never any time-of-incident log message, which was the main clue that this wasn't an issue rooted on the origin or target.
The connectivity probes add a small amount of overhead network connections. On my two systems I couldn't see any degradation in performance.
Oracle Docs: Parameters for the sqlnet.ora File - EXPIRE_TIME
SQLNET and TCP Keepalive Settings - Included for the EXPIRE_TIME setting examples. Part of this article goes one level lower into the TCP Keepalive settings. It's unlikely lower level network settings have anything to do with this problem, at least on the OS side.