Long Java Virtual Machine (JVM) pauses, potentially removing members of the GemFire distributed system from the cluster.
Error Message:
2016-06-30T07:17:16.701-0500: 10899296.854: [GC2016-06-30T07:17:16.701-0500: 10899296.854: [ParNew (promotion failed) Desired survivor size 85878368 bytes, new threshold 1 (max 1) - age 1: 706584 bytes, 706584 total : 839908K->839926K(943744K), 0.2458850 secs]2016-06-30T07:17:16.947-0500: 10899297.100: [CMS: 22365807K->10421671K(30408704K), 31.1687460 secs] 23205326K->10421671K(31352448K), [CMS Perm : 41841K->41830K(262144K)], 31.4150540 secs] [Times: user=31.66 sys=0.00, real=31.41 secs]
Note: We are focussing on the "promotion failed" part of the message. The symptom shows "concurrent mode failure" instead of promotion failed.
The primary symptom, where GemFire removes an unresponsive member from the distributed system, can have many causes. In this case, it is important to use the GC logging output to identify the promotion failure.
In the error message above, you can see the "promotion failed" component of the message. Also, you can see the real time and user time consumed by the given GC. Such high values are indicative of promotion or concurrent mode failures. Various settings that make fragmentation issues more likely include the following:
Any of the following options can be used to reduce the likelihood of heap fragmentation impacting your GemFire cluster:
There are various flags that one can use for GC logging that provide more details in the logs to help diagnose issues. These can prove to be very helpful and add confidence that the issue has been diagnosed correctly. More importantly, some flags are available that could serve to provide some warning that fragmentation is increasing in the JVM. This could help to prevent any unplanned removal of a GemFire node from the cluster. Specifically, consider looking at the following flags to incorporate into your heap/GC configuration:
-XX:PrintFLSStatistics=2 -XX:+CMSDumpAtPromotionFailure
If you were to monitor your GC log files, when using the PrintFLSStatistics option, you would find output similar to the following:
Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 382153298 Max Chunk Size: 382064598 Number of Blocks: 28 Av. Block Size: 13648332 Tree Height: 8 Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 382153298 Max Chunk Size: 382064598 Number of Blocks: 28 Av. Block Size: 13648332 Tree Height: 8
Such output, if monitored proactively, could provide insight as to when your tenured heap is becoming increasingly fragmented. This would be when your maximum chunk size continues to shrink to levels approaching the amount of memory that might be promoted in one GC. This would be something like the maximum survivor space size.
If the maximum chunk size that is available in the tenured heap, decreases to something like 10 times the maximum survivor space size over time, perhaps a planned event to defragment the heap would be warranted. This could be a bounce off of the GemFire member during some planned maintenance window.
Of course, all recommendations include the required testing in your development and lab environments prior to using them in production.