How to work around locator request thread stuck issue in VMware GemFire

Products

VMware Tanzu Gemfire

Issue/Introduction

VMware GemFire clients all failed to connect with VMware GemFire Cluster's locators with NoAvailableLocatorsException.

org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [locator-host1:10334, locator-host2:10334]
at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:XXX)
at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:XXX)
at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:XXX)

Meanwhile, the locator's thread dump has many "locator request thread" (over 90 threads) that are stuck at "java.net.SocketInputStream.socketRead0" API forever.

locator request thread[559906]" #873926 daemon prio=5 os_prio=0 tid=0x00007ff7d4046800 nid=0x3852 runnable [0x00007ff78bd4a000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
	- locked <0x00000005cc033dc8> (a java.lang.Object)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
	- locked <0x00000005cc033e88> (a java.lang.Object)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)
    at org.apache.geode.internal.net.SocketCreator.configureServerSSLSocket(SocketCreator.java:1011)
	at org.apache.geode.distributed.internal.tcpserver.TcpServer$3.run(TcpServer.java:XXX)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Environment

Product Version: Other

Resolution

VMware GemFire locator's use JDK's java.net layer to perform TCP-IP/Socket operations. In this layer, the SocketInputStream.socketRead0() API is used to read and receive the data from remote VMware GemFire clients.

From the above stack trace and symptom, VMware GemFire locators seems not to read the response data completely and "locator request thread" is stuck at "java.net.SocketInputStream.socketRead0()" API.

One possible root cause is a network issue between the client and locators, such as the firewall between the GemFire client and the GemFire locators dropping or blocking some packets. This causes the GemFire locators to keep creating the "locator request thread" until it hit "gemfire.TcpServer.MAX_POOL_SIZE" [*1] threshold.

You need to ask network team to resolve this kind of network issue at first. At the same time, you can work around this issue by increasing "gemfire.TcpServer.MAX_POOL_SIZE" [*1] so that the locator can handle more locator request threads or tuning "gemfire.TcpServer.READ_TIMEOUT" [*2] so that the locator request threads can be timed out.

[*1]: Locator's system parameter: "-Dgemfire.TcpServer.MAX_POOL_SIZE"
This property limits the number of threads that the locator will use for processing gossip messages and server location requests. Default value is 100.

[*2]: Locator's system parameter: "-Dgemfire.TcpServer.READ_TIMEOUT"
This new property is introduced since Gemfire9.0 to limits a timeout on the SSL handshake in the locator's TcpServer. Default value is 60 seconds (60 * 1000).