A common misconception is that the CMS collector runs fully concurrent to the application. This is not the case, even if the stop-the-world phases are usually very short when compared to the concurrent phases. Even though the CMS Collector offers a mostly concurrent solution for old generation GCs, young generation GCs are still handled using a stop-the-world approach. The rationale behind this is that young generation GCs are typically short enough so that the resulting pause times are satisfactory even for latency sensitive applications. All available garbage collectors today have stop-the-world young generation garbage collection with Azul's Zing JVM being the exception.
When using the CMS Collector in real-world applications, we face two major challenges that may create a need for tuning:
Heap fragmentation is possible because, unlike the Throughput Collector, the CMS Collector does not contain any mechanism for defragmentation. In the concurrent phases, application work is still being done, so pauses only happen during the Initial Mark and Remark phases. While Young Generation collections do not typically cause long pauses, collections on old generations can, especially when large heaps are involved. Objects are not copied during this type of collection, meaning that whole spaces of memory are not freed up and the memory in use can become ‘fragmented’. When objects are collected in Old Space, they leave behind ‘holes’ in memory that can be re-used. These holes can be used for new objects (promoted from Young Space) provided that the new object fits in the hole. When a smaller object is put in the hole of a larger object, there is a small amount of unused memory caused by the difference in size. Over time, these small amounts of memory build up and the amount of contiguous memory available for larger objects decreases. This process inevitably results in fragmented memory – where there may be a significant amount of available memory, but it is broken up into a large number of small, non-contiguous memory spaces.
Once objects cannot be allocated in the Old space anymore, due to memory usage or fragmentation, a Full Garbage Collection is triggered to avoid running out of memory. A compaction is done during a full collection, causing an application pause while objects are being moved. Compaction shifts objects around in memory so that the small pieces of available memory can be consumed and a large, contiguous memory space can be made available. Full GC cycles can interrupt the application for many seconds. In the case of GemFire, this may cause the member, where FUll GC is running, to be considered ‘unresponsive’ or ‘lost’ by the cluster and cause it to be rejected from the cluster. Proper tuning of the GC and GemFire membership parameters and proper sizing of the JVM heap can avoid this from happening
The second challenge is high object allocation rate of the application. If the rate at which objects get instantiated is higher than the rate at which the collector removes dead objects from the heap, the concurrent algorithm fails once again. At some point, the old generation will not have enough space available to accommodate an object that is to be promoted from the young generation. This situation is referred to as “concurrent mode failure”, and the JVM reacts just like in the heap fragmentation scenario: It triggers a full GC.
Using the CMS collector requires slightly more heap memory than the expected usage by the objects and slightly more memory than slower Old Space GC algorithms. Since garbage may not be completely identified in the final phase of a CMS cycle, some dereferenced objects can carry over to the next collection cycle. Another reason larger heap sizes are needed is because of memory fragmentation. Memory is not compacted regularly, nor do you want it to be since compaction causes longer pauses. Typically, for GemFire applications you want to see about 50% overhead in the heap for the Old Generation, but this factor is significantly affected by the longevity of your managed objects and by the read/write operation ratio of your application.