GemFire cluster may be in a hanging state when conserve-sockets=true is set up like this with the cluster in a high load situation. When it is hanging, you may see the following symptoms:
A. Thread stack
"ServerConnection on port 12480 Thread 929" tid=0x8fe (in native) java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) - locked java.lang.Object@241d89aa at com.gemstone.gemfire.internal.tcp.Connection.nioWriteFully(Connection.java:3277) - locked java.lang.Object@7d36d2bc at com.gemstone.gemfire.internal.tcp.Connection.sendPreserialized(Connection.java:2511) at com.gemstone.gemfire.internal.tcp.MsgStreamer.realFlush(MsgStreamer.java:317) at com.gemstone.gemfire.internal.tcp.MsgStreamer.writeMessage(MsgStreamer.java:245) at com.gemstone.gemfire.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:458) at com.gemstone.gemfire.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:310) at com.gemstone.gemfire.distributed.internal.direct.DirectChannel.send(DirectChannel.java:696) at com.gemstone.gemfire.distributed.internal.membership.jgroup.JGroupMembershipManager.directChannelSend(JGroupMembershipManager.java:2844) at com.gemstone.gemfire.distributed.internal.membership.jgroup.JGroupMembershipManager.send(JGroupMembershipManager.java:3078) at com.gemstone.gemfire.distributed.internal.DistributionChannel.send(DistributionChannel.java:79) at com.gemstone.gemfire.distributed.internal.DistributionManager.sendOutgoing(DistributionManager.java:3780) at com.gemstone.gemfire.distributed.internal.DistributionManager.sendMessage(DistributionManager.java:3821) at com.gemstone.gemfire.distributed.internal.DistributionManager.putOutgoing(DistributionManager.java:1957) at com.gemstone.gemfire.internal.cache.partitioned.DestroyMessage.send(DestroyMessage.java:213) at com.gemstone.gemfire.internal.cache.PartitionedRegion.destroyRemotely(PartitionedRegion.java:5734) at com.gemstone.gemfire.internal.cache.PartitionedRegion.destroyInBucket(PartitionedRegion.java:5552) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataView.destroyExistingEntry(PartitionedRegionDataView.java:45) at com.gemstone.gemfire.internal.cache.PartitionedRegion.basicDestroy(PartitionedRegion.java:5419) at com.gemstone.gemfire.internal.cache.LocalRegion.validatedDestroy(LocalRegion.java:1143) at com.gemstone.gemfire.internal.cache.LocalRegion.destroy(LocalRegion.java:1130) at com.gemstone.gemfire.internal.cache.AbstractRegion.destroy(AbstractRegion.java:315) at com.gemstone.gemfire.internal.cache.LocalRegion.remove(LocalRegion.java:9362) ...... "ServerConnection on port 12480 Thread 875" tid=0x8c5 owned by "ServerConnection on port 12480 Thread 929" tid=0x8fe java.lang.Thread.State: BLOCKED at com.gemstone.gemfire.internal.tcp.Connection.nioWriteFully(Connection.java:3264) - blocked on java.lang.Object@7d36d2bc at com.gemstone.gemfire.internal.tcp.Connection.sendPreserialized(Connection.java:2511) at com.gemstone.gemfire.internal.tcp.MsgStreamer.realFlush(MsgStreamer.java:317) at com.gemstone.gemfire.internal.tcp.MsgStreamer.writeMessage(MsgStreamer.java:245) at com.gemstone.gemfire.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:458) at com.gemstone.gemfire.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:310) at com.gemstone.gemfire.distributed.internal.direct.DirectChannel.send(DirectChannel.java:696) at com.gemstone.gemfire.distributed.internal.membership.jgroup.JGroupMembershipManager.directChannelSend(JGroupMembershipManager.java:2844) at com.gemstone.gemfire.distributed.internal.membership.jgroup.JGroupMembershipManager.send(JGroupMembershipManager.java:3078) at com.gemstone.gemfire.distributed.internal.DistributionChannel.send(DistributionChannel.java:79) at com.gemstone.gemfire.distributed.internal.DistributionManager.sendOutgoing(DistributionManager.java:3780) at com.gemstone.gemfire.distributed.internal.DistributionManager.sendMessage(DistributionManager.java:3821) at com.gemstone.gemfire.distributed.internal.DistributionManager.putOutgoing(DistributionManager.java:1957) at com.gemstone.gemfire.internal.cache.partitioned.DestroyMessage.send(DestroyMessage.java:213) at com.gemstone.gemfire.internal.cache.PartitionedRegion.destroyRemotely(PartitionedRegion.java:5734) at com.gemstone.gemfire.internal.cache.PartitionedRegion.destroyInBucket(PartitionedRegion.java:5552) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataView.destroyExistingEntry(PartitionedRegionDataView.java:45) at com.gemstone.gemfire.internal.cache.PartitionedRegion.basicDestroy(PartitionedRegion.java:5419) at com.gemstone.gemfire.internal.cache.LocalRegion.validatedDestroy(LocalRegion.java:1143) at com.gemstone.gemfire.internal.cache.LocalRegion.destroy(LocalRegion.java:1130) at com.gemstone.gemfire.internal.cache.AbstractRegion.destroy(AbstractRegion.java:315) at com.gemstone.gemfire.internal.cache.LocalRegion.remove(LocalRegion.java:9362) ...... "ServerConnection on port 12480 Thread 873" tid=0x8c3 owned by "ServerConnection on port 12480 Thread 929" tid=0x8fe java.lang.Thread.State: BLOCKED at com.gemstone.gemfire.internal.tcp.Connection.nioWriteFully(Connection.java:3264) - blocked on java.lang.Object@7d36d2bc at com.gemstone.gemfire.internal.tcp.Connection.sendPreserialized(Connection.java:2511) ...... "ServerConnection on port 12480 Thread 1394" tid=0xaee owned by "ServerConnection on port 12480 Thread 929" tid=0x8fe java.lang.Thread.State: BLOCKED at com.gemstone.gemfire.internal.tcp.Connection.nioWriteFully(Connection.java:3264) - blocked on java.lang.Object@7d36d2bc at com.gemstone.gemfire.internal.tcp.Connection.sendPreserialized(Connection.java:2511) ...... "PartitionedRegion Message Processor105" tid=0x768 owned by "ServerConnection on port 12480 Thread 929" tid=0x8fe java.lang.Thread.State: BLOCKED at com.gemstone.gemfire.internal.tcp.Connection.nioWriteFully(Connection.java:3264) - blocked on java.lang.Object@7d36d2bc at com.gemstone.gemfire.internal.tcp.Connection.sendPreserialized(Connection.java:2511) ......
B. The cacheserver log file contains many messages like the ones below:
[warn 2017/03/15 19:38:40.072 CST tid=0x4f4] 15 seconds have elapsed while waiting for replies: <PutMessage$PutResponse 2569 waiting for 1 replies from [......] [warn 2017/03/15 19:38:40.072 CST tid=0x4bc] 15 seconds have elapsed while waiting for replies: <GetMessage$GetResponse 2571 waiting for 1 replies from [......] [warn 2017/03/15 19:38:41.564 CST tid=0x5e8] 15 seconds have elapsed while waiting for replies: <com.gemstone.gemfire.internal.cache.PartitionedRegionQueryEvaluator$StreamingQueryPartitionResponse 2588 waiting for 1 replies from [......]
From the above thread dump and logging information, we can see that the GemFire cluster is stuck at a synchronization point in Connection.nioWriteFully between peer and peer. This blocking is caused by sharing sockets in the application threads when conserve-sockets=true.