If you have had a member kicked out of the distributed system due to not being responsive, this article may apply to your environment and help you to eliminate such incidents. If you have seen this log message anywhere in your environment, please consider the topics in this article to proactively manage and eliminate such issues:
[severe 2015/08/26 22:26:04.725 UTC gemfire-node-49001 tid=0x3ba] Membership service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system. Reason='did not respond to are-you-dead messages'
If you are seeing Full GC's in your gc logs, again, you may want to consider some guidelines addressed in this article. Even if you do not see members getting kicked out of the distributed system, the following log message anywhere in your logs is an indication that you are subject to a GC related issue at any time:
[warning 2015/08/31 14:39:07.617 CET <Thread-6 StatSampler> tid=0x2e] Statistics sampling thread detected a wakeup delay of 6,073 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
Your total heap size is obviously a key component of your heap management. Hopefully, you have sized the space required by your application to store all the data in the cache, with recommended overhead to handle bursts and the failures of other nodes. If you do not have enough capacity to store all the data that you will be placing into the cache, you may run into various issues such as GC thrashing, excessive eviction if configured, surpassing the critical-threshold which must be avoid for any healthy environment.
Thus, having enough total heap capacity is key, but having too much can also be an issue. You should always do testing and see how much of your tenured heap capacity is consumed under the expected full production load. If you are considerably below 40%, consider lowering your Xmx and Xms settings. Having an unnecessarily large heap has its own issues.
Please note that Xmx should always be equal to Xms in your environment to avoid any unnecessary JVM issues when the need to grow or shrink the heap arises. This can cause a full GC in your environment which can cause the GemFire distributed system to kick out an otherwise healthy member.
This setting is often a key cause of GC and overall heap related issues in GemFire environments. The goal is to find the setting that is just right, versus too big or too small. If your NewSize is too big, consuming too high a percentage of the total heap, you are subject to a number of issues including:
The primary goal of your heap should be to be able to store all of the data you want in your cache. If you are making your Eden space too big (NewSize), you are taking away from your old generation (tenured) space, and therefore limit the number of long lived objects that can be stored in the cache. If you have eviction configured, this can prevent overrunning your total heap, but it is at the expense of no longer having the data you really want in your cache. More importantly, you are making your environment very susceptible to concurrent mode failures, where the system can not promote objects from the survivor space to your tenured space. This can be due to fragmentation of your tenured heap space, combined with the size of the collection of objects being promoted. This promotion must be done in one contiguous chunk of tenured heap, so if the size of the promotion is too big, it is possible that a contiguous chunk is not available, and this drives a Full GC. This will almost always drive the GemFire distributed system to kick out that member, who becomes unresponsive during that Full GC.
Other issues can be created by having your Eden space too small. The rate of creation of new objects is a key component. If you are filling your Eden space too quickly, driving parNew collections very frequently, then this too can overwhelm your system. A simple way to diagnose this in your environment is to examine VSD in your environment to look at the following statistics:
General guideliness to follow are these. You should rarely if ever have more than 1 ParNew collection per second registered. If you do, your NewSize is almost certainly too small and should be increased in your environment until you have eliminated all statistics sampled which showed more than 1 per second. ParNew collection times exceeding 1000ms (1 second) should certainly be researched to establish exactly why your collections are taking so much time. A very large NewSize might encounter such times due to all of the objects it needs to mark as having survived another tenure. Still, a very small NewSize might have issues of its own, because when NewSize is too small, objects get promoted more quickly into tenured space. This can cause objects in Eden space referencing objects in tenured space, and vice versa, and this is not an optimal thing. Finding a balance is key, which is why fine tuning your heap configuration is highly recommended.
One thing to measure to assist you in finding the correct NewSize is to examine the rate of memory increase in your EdenSpace using VSD. Look for your patterms of usage, and determine the rate of which heap is consumed by adding new objects. Perhaps you are filling Eden space at the rate of 100mb/second. Maybe a NewSize of 200mb or 300mb may prove insufficient here. You can look at the collections/second as suggested earlier, and if you are doing more than 1 per second, increase NewSize.
Of course, other factors come into play such as Survivor ratio, etc., but this is a very good first pass at eliminating issues in your environment. Simple changes such as those addressed in this article have eliminated many issues over the past few years in customer environments.
See this article for details on using VSD to examine some of the above statistics.
GemFire 7 and later