VMware GemFire: Resolving thread exhaustion issues and client timeouts

Products

VMware Tanzu Gemfire

Issue/Introduction

This article attempts to help resolve issues where you are seeing connection timeouts or thread exhaustion.

There are many causes of such issues, but this article touches upon the most common causes being seen with VMware GemFire. Customers often open tickets with symptoms, such as seeing increased client timeout issues or thread exhaustion on the server side.

These symptoms are often resolved simply with a more optimal configuration across both VMware GemFire and your network configuration settings.

One possible symptom is the thread exhaustion log message:

[warning 2021/03/22 13:52:36.302 EDT xxx <Handshaker 0.0.0.0/0.0.0.0:40404 Thread 1> tid=0x53] Rejected connection from Server connection from [client host address=xxx; client port=xxx] because incoming request was rejected by pool possibly due to thread exhaustion
===============

You may also see examples such as these in your logs:

[info 2021/03/22 13:47:58.751 EDT xxx<disconnect thread for xxx(xxx:xxx)<v6>:41000> tid=0x741] Timed out waiting for readerThread on xxx(xxx:xxx)<v6>:41000@1418(GEODE 1.8.0) to finish.

[warning 2021/03/22 13:47:58.782 EDT cache5 <ClientHealthMonitor Thread> tid=0x64] Server connection from [identity(xxx(15:loner):52392:xxx,connection=1; port=34810] is being terminated because its client timeout of 3000 has expired.

==========

The above client timeout is the client side pool read-timeout value. Increasing this may be sufficient in some cases. The default is 10,000 ms (10 seconds).

However, when the system is getting overwhelmed by current attempts to connect, you may need to alter more of your configuration.

These symptoms are caused by an insufficient accept queue at the TCP layer. Furthermore, when combined with a burst of client to server connection activity, these symptoms may be seen.

Environment

Product Version: 9.10

Resolution

Checklist

Check system property BridgeServer.HANDSHAKE_POOL_SIZE. If not currently set, it is likely using a very low legacy setting of 4, which is completely inadequate. The default is being increased in new versions, but the recommendation is to set this minimally to 20, and potentially higher if you see evidence warranting it.
Check your /etc/sysctl.conf and net.core.somaxconn setting. If this is only 128 and the default in many current systems, this is much too low. The VMware GemFire Best practice guide suggests 1280.
You can potentially determine whether you are having such issues by using the "nstat -a" command, and examining the TcpExtListenDrops and TcpExtListenOverflow values. If nonzero, you are vulnerable and should increase somaxconn.
Set the system property p2p.backlog=1280 as well, the same value as somaxconn.
Please always read the VMware GemFire Best Practices Guide to refresh your knowledge for how best to configure many aspects of VMware GemFire.

Follow the checklist included above, but to summarize:

BridgeServer.HANDSHAKE_POOL_SIZE=50
p2p.backlog=1280
somaxconn=1280
Increase client side pool read-timeout setting to eliminate timeouts and retries.
Increase server side max-connections if necessary, but this is generally not needed if the above steps resolve the issue.