Tuning TCP Keepalive for In-Progress Tasks

book

Article ID: 142410

calendar_today

Updated On:

Products

CA Identity Manager CA Identity Governance CA Identity Portal CA Identity Suite

Issue/Introduction

 

We are seeing Session Timeout Errors in the server log after we login to user console and stay idle for 5 minutes. We can still submit tasks from the user console. The task status is not getting updated. shows In progress. Even though events may have completed. in some cases, events are also not completing.
We recently changed session timeout to 15 minutes.

Cause

Firewall timeouts were preventing tasks from completing.

The following commands were executed on one of the IdM servers to check the tcp_keepalive and tcp_retries2 configurations:

Command: sysctl -a | grep net.ipv4.tcp_keepalive
Result:
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

Command: sysctl -a | grep net.ipv4.tcp_retries2
Result: net.ipv4.tcp_retries2 = 15

The system starts to send TCP keepalive packet after 7200s and then new packet every 15s, restransmitting the same packet 5x will mean that the connection is broken. Given a default value of tcp_retries2 set to 15, it means that it takes 924.6 seconds before a broken network link is notified to the upper layer (ie. application), the details are explained here: https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html.

Based on the VST review, it seems that most of the delays had a time gap of 15 minutes. A few had longer delays. All appeared to be related to some provisioning operations. We reviewed the logs for one of the tasks that took a long time to complete. We noticed that the IdM and IMPS servers were on different subnets. From the etatrans log it appeared that IMPS did not receive the request from IdM on a timely basis. Once IMPS received the request, it was able to process it quickly. In case the network device between IdM and IMPS closes idle connections before the keepalive packet is sent, we recommend the following configuration to be tested in the QUAL environment:

Resolution

--LINUX--

1. Logon the IdM server as the "config" user
2. Modify the /etc/sysctl.conf file: vi /etc/sysctl.conf

Add the following line:

net.ipv4.tcp_keepalive_time=600

Note: On vAPP put the  entry net.ipv4.tcp_keepalive_time = 60,  or any other required changes or additions, above the section

# Controls source route verification
###########################
# CA Technologies - START #
###########################

because anything written between sections

# Controls source route verification
###########################
# CA Technologies - START #
###########################

and 

########################## CA Technologies - END #
#########################

are overwritten by the startup scripts in Vapp. 


Save the change
3. Run the following command to reload the change: sysctl -p
4. Run the following command to verify the update: sysctl -a | grep net.ipv4.tcp_keepalive
5. Restart the IdM server (Wildfly)

------------
FYI, note that net.ipv4.tcp_keepalive_time=600 is just an example, it means to be shorter than the configured network timeout between IdM and IMPS. If the network timeout is shorter, then the tcp_keepalive_time value should be reduced accordingly.

----------
The example we provided is to set keepalive_time = 600 (10 minutes). If the new delay of 10 minutes is related to the reduced keepalive_time, we advise to further reduce the keepalive_time to, say, 60 seconds so the impact can be measured.

The following web site and several other sites state that " Typically TCP Keepalives are sent every 45 or 60 seconds on an idle TCP connection".

https://en.wikipedia.org/wiki/Keepalive

"Typically TCP Keepalives are sent every 45 or 60 seconds on an idle TCP connection, and the connection is dropped after 3 sequential ACKs are missed. This varies by host, e.g. by default Windows PCs send the first TCP Keepalive packet after 7200000ms (2 hours), then sends 5 Keepalives at 1000ms intervals, dropping the connection if there is no response to any of the Keepalive packets".

 

--WINDOWS--

For Windows Server, there is a registry entry that can be added to set TCP Keepalive Time. Windows also defaults to 7200 seconds unless you set the key. The key is that the time must be specified in milliseconds. 60 seconds is stored as REG_DWORD 0x0000ea60 (60000).

The key is:
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime.