Some of the Tanzu GemFire Cache members don’t upgrade to the latest version of Tanzu GemFire due to multiple failed errand attempts. The failing cache member doesn’t come up due to an unmatched version on the disk store files with the latest Tanzu GemFire version. The typical exception would be UnsupportedVersionException.
In order to follow this recovery process, please make sure that every region is configured for disk persistence and partitioned region’s redundant copies are set to at least 1. For replicated regions, the data is replicated across all the cache members.
gfsh> show missing-disk-stores
8. Perform re-balancing of data using the below command:
gfsh>rebalance
When a cache member tries to recover its data from its disk store files, it finds that its data was never part of the same distributed system as the currently running cache members. This could have happened either during incomplete upgrade or disk store file corruption. The cache member usually throws the following ConflictingPersistentDataException error message:
“org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region /abc remote member 11.111.11.11(server1:103101)<v2>:10040 with persistent data /11.111.11.11:/gemfire/pivotal/diskstores/ds009/defaultgrp/pdx created at timestamp 1564761134245 version 0 diskStoreId 96c17baa79d44608-b9face0cc87b1597 name server1 was not part of the same distributed system as the local data from /22.222.22.22:/gemfire/pivotal/diskstores/xxx/default123/pdx created at timestamp 1554472286294 version 0 diskStoreId bc8137967477435e-83499e76b5537e18 name server2”
This indicates that server2 probably came up and was taken back down again while server1 was down, and then server1 came up while server2 was down. Hence, when server2 is started again, it sees nodes with persistent data (server1) that were not part of the cluster the last time it (server2) was up. It is basically a network partition across “time”.
Note: There is no easy way to recover from this error and human intervention is needed because of the nature of this issue. There is a chance of data loss.
In order to recover from this type of situation, we need to identify most “correct” data by looking at the timestamp of the logs of Tanzu GemFire servers and disk stores. You may need to utilize some data utility to get the human readable data time like date -r in Unix. Once you identify the “correct” data, please move aside or delete the persistent files for the members with other data, monit stop that member(s) and then monit start. While starting, if multiple members are involved, please ensure if all the members are fully started.
If “correct” data cannot be identified, it is good to get the data from the backup files or the exported files. In order to do that, remove all the persistent disk stores from all the members, monit shutdown all the members, start the members, and then import the data.
Note: All the partitioned and replicated region data is recovered from other running cache members.
Cache members fail with ‘missing-disk-stores’ while starting up or in cases where missing disk stores appear in ‘show missing-disk-stores’, the following procedures need to be followed to recover from the failing cache member.
1. Check if the cache member is down, start the cache member: monit start server.
2. If cache member fails with ‘missing disk-stores’ issue after the restart.
3. Revoke missing disk stores using the below command:
gfsh> revoke missing-disk-store --id=<disk-store-id>
Note: Sometimes the missing disk store messages can be misleading. Before revoking the disk stores, make sure all the members of the cluster are in the stable state.
4. On the failing cache member, remove missing disk store files under:
/var/vcap/store/gemfire-server/<disk-store-files>
5. Start cache member with the command: monit start gemfire-server. All the partitioned and replicated region data is recovered from other running cache members.
Out of memory error happens when a cache member can’t find free memory in Java Heap. This can happen for various reasons, some of which are:
The locator may throw ‘Out of Memory’ when, for instance:
To temporarily recover from a ‘Out of memory’ error, start the gemfire-server/gemfire-locator and scale out the cluster in case the majority of the cache members are beyond the critical threshold (usually 95%) for heap usage.
Note: This problem may cause some of the other failures discussed on this page.
With a proper monitoring of the JVM usage, the underlying JVM usage/spike and cache member stats/logs should be analyzed and appropriate actions should be taken to prevent such errors.
In cases of infrastructure failure such as network, disk drives or other infrastructure related issues, one or more cache members fail to startup.
Since there is one cache member failing, restarting a cache member should get the cluster in a healthy state.
1. Fix the underlying infrastructure issues.
2. Depending issue type and its impacts on VMware Tanzu GemFire clusters, reboot the VMware Tanzu Gemfire instance using bosh stop/start or monit stop all.
3. Start cache member with the command: monit start gemfire-locator/gemfire-server.
1. Fix the underlying infrastructure issues.
2. Stop running cache members with the command: monit stop gemfire-server.
3. Stop running locators with the command: monit stop gemfire-locator.
4. Depending issue type and its impact on VMware Tanzu GemFire clusters, reboot the VMware Tanzu GemFire instance(s) using bosh stop/start.
5. Make sure that locators and cache members are shutdown
6. Start all locators with the command: monit start gemfire-locator.
7. Start Cache members with the command: monit start gemfire-server.
Internal Server Error","message":"Unable to connect to any locators in the list [LocatorAddress [socketInetAddress=00aaa000-000a-00a0-0000-b00a0000000.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh/xx.x.x.xx:55221, hostname=50acc404-686f-48b6-8990-b08d6e4d0f8a.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh, isIpString=false], LocatorAddress [socketInetAddress=xx.x.x.xx/xx.x.x.xx:55221, hostname=xx.x.x.xx, isIpString=true]]; nested exception is org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [LocatorAddress [socketInetAddress=55ccc444-666f-44b4-9990-b08d6e4d0f8a.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh/xx.x.x.xx:55221, hostname=50acc404-686f-48b6-8990-b08d6e4d0f8a.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh, isIpString=false], LocatorAddress [socketInetAddress=xx.x.x.xx/xx.x.x.xx:55221, hostname=xx.x.x.xx, isIpString=true]]","path":"/brd/adhocUpdate"}
Most of the time, recreating the service instance fixes the issue. Otherwise, a support ticket may be necessary. More specifically, in cases of locator failure where the below exception is seen and the GemFire locator fails to start.
Exception in thread "main" java.lang.IllegalStateException: The init file "./BACKUPDEFAULT.if" does not exist or Exception in thread "main" java.lang.IllegalStateException: The <file> file <.crf> or <.drf>" does not exist
/var/vcap/store/gemfire-locator/ConfigDiskDir_locator-*
update -service when L Error: - Unable to render templates for job 'gemfire-locator'.
monit start gemfire-locator
cf update-service SERVICE_INSTANCE