Node Fails to Start due to EOFException when Requesting Configuration from the Locator

Products

VMware Tanzu Gemfire

Issue/Introduction

Symptoms:

A GemFire peer may exit immediately on startup with the following exception in its logs, standard output, or, when running under Tomcat, the "catalina.out" file:

java.io.EOFException: Locator at 10.0.0.11(locator2:25967):10334 did not respond. This is normal if the locator was shutdown. If it wasn't check its log for exceptions.
at com.gemstone.org.jgroups.stack.tcpserver.TcpClient.requestToServer(TcpClient.java:125)
at com.gemstone.org.jgroups.stack.tcpserver.TcpClient.requestToServer(TcpClient.java:78)
at com.gemstone.gemfire.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:171)
at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.requestAndApplySharedConfiguration(GemFireCacheImpl.java:868)
at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.init(GemFireCacheImpl.java:1019)
at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:684)
at com.gemstone.gemfire.cache.CacheFactory.create(CacheFactory.java:182)
at com.gemstone.gemfire.cache.CacheFactory.create(CacheFactory.java:229)

Investigating the relevant locator logs reveals the following message:

[error 2017/12/07 22:09:48.877 UTC tid=0x3f] Error occurred while initializing cluster configuration
java.lang.Exception: Error occurred while initializing cluster configuration

This indicates that the actual issue is with the cluster configuration service on the locator.

Environment

Cause

There are a number of possible reasons for the cluster configuration service to fail to start, but we'll focus on one particular case. The cluster configuration service relies on a shared region called the _ConfigurationRegion. Any failure to create this region will cause the configuration service to not come up. In particular, issues with the disk store, such as a ConflictingPersistentDataException:

[info 2017/12/07 22:09:48.861 UTC tid=0x3f] Initializing region _ConfigurationRegion

[warning 2017/12/07 22:09:48.867 UTC tid=0x3f] Initialization failed for Region /_ConfigurationRegion
com.gemstone.gemfire.cache.persistence.ConflictingPersistentDataException: Region /_ConfigurationRegion remote member 10.0.0.10(16303:locator):29734 with persistent data /10.0.0.10:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1512684585365 version 0 diskStoreId 3aba183676bc4730-99cf92cee56428a9 name was not part of the same distributed system as the local data from /10.0.0.11:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1477323218275 version 0 diskStoreId d73a8b008ea14ed1-8f53d86370cc5ab5 name 
at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:714)
at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:786)
at com.gemstone.gemfire.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:47)
at com.gemstone.gemfire.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1381)
at com.gemstone.gemfire.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1209)
at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:2994)
at com.gemstone.gemfire.distributed.internal.SharedConfiguration.getConfigurationRegion(SharedConfiguration.java:635)
at com.gemstone.gemfire.distributed.internal.SharedConfiguration.initSharedConfiguration(SharedConfiguration.java:272)
at com.gemstone.gemfire.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:726)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:693)
at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:964)
at java.lang.Thread.run(Thread.java:745)

[error 2017/12/07 22:09:48.877 UTC tid=0x3f] Error occurred while initializing cluster configuration
java.lang.Exception: Error occurred while initializing cluster configuration
at com.gemstone.gemfire.distributed.internal.SharedConfiguration.getConfigurationRegion(SharedConfiguration.java:646)
at com.gemstone.gemfire.distributed.internal.SharedConfiguration.initSharedConfiguration(SharedConfiguration.java:272)
at com.gemstone.gemfire.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:726)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:693)
at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:964)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.gemstone.gemfire.cache.persistence.ConflictingPersistentDataException: Region /_ConfigurationRegion remote member 10.0.0.10(16303:locator):29734 with persistent data /10.0.0.10:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1512684585365 version 0 diskStoreId 3aba183676bc4730-99cf92cee56428a9 name was not part of the same distributed system as the local data from /10.0.0.11:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1477323218275 version 0 diskStoreId d73a8b008ea14ed1-8f53d86370cc5ab5 name 
at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:714)
at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:786)
at com.gemstone.gemfire.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:47)
at com.gemstone.gemfire.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1381)
at com.gemstone.gemfire.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1209)
at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:2994)
at com.gemstone.gemfire.distributed.internal.SharedConfiguration.getConfigurationRegion(SharedConfiguration.java:635)
... 9 more

Resolution

In this case, the running locator, 10.0.0.10(16303: locator):29734, will have more current data, which this locator will recover from. Hence, the solution is to shut down the problematic locator, remove its persistence files, restart it, and allow it to recover the _ConfigurationRegion from the healthy locator.

After doing so, the logs of the restarted locator should indicate that the region was successfully recovered from the other locater and the cluster configuration service started. Something like:

[info 2017/12/07 22:09:45.350 UTC tid=0x7a] Region /_ConfigurationRegion requesting initial image from 10.0.0.10(16303:locator):29734

[info 2017/12/07 22:09:45.360 UTC tid=0x7a] /_ConfigurationRegion is done getting image from 10.0.0.10(16303:locator):29734. isDeltaGII is false

[info 2017/12/07 22:09:45.375 UTC tid=0x7a] Initialization of region _ConfigurationRegion completed

[info 2017/12/07 22:09:45.391 UTC tid=0x7a] Completed writing the shared configuration to 'cluster_config' directory

[info 2017/12/07 22:09:45.391 UTC tid=0x7a] Getting Jar files from other locators

[info 2017/12/07 22:09:45.411 UTC tid=0x7a] No deployed jars found on other locators.

[info 2017/12/07 22:09:45.415 UTC tid=0x7a] ConfigRequestHandler installed

[info 2017/12/07 22:09:45.416 UTC tid=0x7a] Cluster configuration service start up completed successfully and is now running ....