It takes a long time to initialize a persistent region unexpectedly

Products

VMware Tanzu Gemfire

Issue/Introduction

Symptoms:

If you create a persistent region with indexes via cache.xml configuration, it may take a long time to initialize it unexpectedly after recovering data entries from diskstores (more than 1 hour in some cases). You can notice that it takes a long time to initialize a region from the following kind of cache server log messages and those timestamps (in this example, it takes more than 5 hours to initialize the ExampleRegion region):

[info 2018/03/26 15:36:11.123 UTC MyCacheServer1 <main> tid=0x1] Initializing region ExampleRegion
  :
[info 2018/03/26 20:42:32.548 UTC MyCacheServer1 <main> tid=0x1] Initialization of region ExampleRegion completed

Environment

Cause

If taking some thread dumps, you will see the following kind of thread stack for the main thread in this case:

"main" #1 prio=5 os_prio=0 tid=0x000000001a452800 nid=0x17e2 runnable [0x00002b52eb240000]
java.lang.Thread.State: RUNNABLE
at com.gemstone.gemfire.cache.query.internal.index.MemoryIndexStore.getOldKey(MemoryIndexStore.java:246)
at com.gemstone.gemfire.cache.query.internal.index.MemoryIndexStore.basicRemoveMapping(MemoryIndexStore.java:370)
at com.gemstone.gemfire.cache.query.internal.index.MemoryIndexStore.removeMapping(MemoryIndexStore.java:270)
at com.gemstone.gemfire.cache.query.internal.index.CompactRangeIndex$IMQEvaluator.applyProjection(CompactRangeIndex.java:1682)
at com.gemstone.gemfire.cache.query.internal.index.CompactRangeIndex$IMQEvaluator.doNestedIterations(CompactRangeIndex.java:1614)
at com.gemstone.gemfire.cache.query.internal.index.CompactRangeIndex$IMQEvaluator.doNestedIterations(CompactRangeIndex.java:1624)
at com.gemstone.gemfire.cache.query.internal.index.CompactRangeIndex$IMQEvaluator.evaluate(CompactRangeIndex.java:1465)
at com.gemstone.gemfire.cache.query.internal.index.CompactRangeIndex.removeMapping(CompactRangeIndex.java:159)
at com.gemstone.gemfire.cache.query.internal.index.AbstractIndex.removeIndexMapping(AbstractIndex.java:500)
at com.gemstone.gemfire.cache.query.internal.index.IndexManager.processAction(IndexManager.java:1133)
at com.gemstone.gemfire.cache.query.internal.index.IndexManager.updateIndexes(IndexManager.java:989)
at com.gemstone.gemfire.cache.query.internal.index.IndexManager.updateIndexes(IndexManager.java:963)
at com.gemstone.gemfire.internal.cache.AbstractRegionEntry.destroy(AbstractRegionEntry.java:734)
at com.gemstone.gemfire.internal.cache.AbstractRegionMap.destroyEntry(AbstractRegionMap.java:3271)
at com.gemstone.gemfire.internal.cache.AbstractRegionMap.destroy(AbstractRegionMap.java:1540)
- locked <0x0000000c6449bf00> (a com.gemstone.gemfire.internal.cache.VersionedThinDiskRegionEntryHeapStringKey2)
at com.gemstone.gemfire.internal.cache.LocalRegion.mapDestroy(LocalRegion.java:6900)
at com.gemstone.gemfire.internal.cache.LocalRegion.destroyRecoveredEntry(LocalRegion.java:11210)
at com.gemstone.gemfire.internal.cache.DiskRegion$1.handleRegionEntry(DiskRegion.java:304)
- locked <0x0000000c6449bf00> (a com.gemstone.gemfire.internal.cache.VersionedThinDiskRegionEntryHeapStringKey2)
at com.gemstone.gemfire.internal.cache.LocalRegion.foreachRegionEntry(LocalRegion.java:7844)
at com.gemstone.gemfire.internal.cache.DiskRegion.destroyOldTomstones(DiskRegion.java:296)
at com.gemstone.gemfire.internal.cache.DiskRegion.finishInitializeOwner(DiskRegion.java:270)
at com.gemstone.gemfire.internal.cache.DistributedRegion.cleanUpDestroyedTokensAndMarkGIIComplete(DistributedRegion.java:1697)
at com.gemstone.gemfire.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1476)
at com.gemstone.gemfire.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1209)
:

This indicates that the main thread repeat to remove indexes created for each recovered old TombStones from diskstores and remove those TombStones one by one. They are time consuming tasks - especially, removing indexes in this timing.

Resolution

If you observe this issue, the workaround is to create indexes for the problem region after starting the target members via gfsh command or API, rather than cache.xml, in terms of eliminating time consuming indexes removal at the region initialization (i.e., not create indexes before older TombStones removal, as the workaround).