How to recover from some Tanzu GemFire cluster failure scenarios

search cancel

How to recover from some Tanzu GemFire cluster failure scenarios

book

Article ID: 294373

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

There could be multiple reasons for a Tanzu GemFire cluster to fail or become unable to recover by itself. This article covers some of the common scenarios when this may occur.

Environment

Product Version: 1.11

Resolution

Mismatched diskstore versions

Some of the Tanzu GemFire Cache members don’t upgrade to the latest version of Tanzu GemFire due to multiple failed errand attempts. The failing cache member doesn’t come up due to an unmatched version on the disk store files with the latest Tanzu GemFire version. The typical exception would be UnsupportedVersionException.

In order to follow this recovery process, please make sure that every region is configured for disk persistence and partitioned region’s redundant copies are set to at least 1. For replicated regions, the data is replicated across all the cache members.

Recovery Process

1. Stop all Cache members: monit stop gemfire-server

2. Stop all locators: monit stop gemfire-locator

3. Please take a backup before deleting the files. On the failing cache member node, delete disk store files under: /var/vcap/store/gemfire-server

Note: Please make sure to delete disk store files only on the failing cache member.

4. Start all locators: monit start gemfire-locator

5. Start good cache member(s) first.

6. Start the failing cache member last.

7. Verify there are no missing disk stores:

gfsh> show missing-disk-stores

8. Perform re-balancing of data using the below command:

gfsh>rebalance

Conflicting Persistent Data

When a cache member tries to recover its data from its disk store files, it finds that its data was never part of the same distributed system as the currently running cache members. This could have happened either during incomplete upgrade or disk store file corruption. The cache member usually throws the following ConflictingPersistentDataException error message:

“org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region /abc remote member 11.111.11.11(server1:103101)<v2>:10040 with persistent data /11.111.11.11:/gemfire/pivotal/diskstores/ds009/defaultgrp/pdx created at timestamp 1564761134245 version 0 diskStoreId 96c17baa79d44608-b9face0cc87b1597 name server1 was not part of the same distributed system as the local data from /22.222.22.22:/gemfire/pivotal/diskstores/xxx/default123/pdx created at timestamp 1554472286294 version 0 diskStoreId bc8137967477435e-83499e76b5537e18 name server2”

This indicates that server2 probably came up and was taken back down again while server1 was down, and then server1 came up while server2 was down. Hence, when server2 is started again, it sees nodes with persistent data (server1) that were not part of the cluster the last time it (server2) was up. It is basically a network partition across “time”.

Recovery

Note: There is no easy way to recover from this error and human intervention is needed because of the nature of this issue. There is a chance of data loss.

In order to recover from this type of situation, we need to identify most “correct” data by looking at the timestamp of the logs of Tanzu GemFire servers and disk stores. You may need to utilize some data utility to get the human readable data time like date -r in Unix. Once you identify the “correct” data, please move aside or delete the persistent files for the members with other data, monit stop that member(s) and then monit start. While starting, if multiple members are involved, please ensure if all the members are fully started.

If “correct” data cannot be identified, it is good to get the data from the backup files or the exported files. In order to do that, remove all the persistent disk stores from all the members, monit shutdown all the members, start the members, and then import the data.

Note: All the partitioned and replicated region data is recovered from other running cache members.

Missing Disk Stores

Cache members fail with ‘missing-disk-stores’ while starting up or in cases where missing disk stores appear in ‘show missing-disk-stores’, the following procedures need to be followed to recover from the failing cache member.

Recovery

1. Check if the cache member is down, start the cache member: monit start server.

2. If cache member fails with ‘missing disk-stores’ issue after the restart.

3. Revoke missing disk stores using the below command:

gfsh> revoke missing-disk-store --id=<disk-store-id>

Note: Sometimes the missing disk store messages can be misleading. Before revoking the disk stores, make sure all the members of the cluster are in the stable state.

4. On the failing cache member, remove missing disk store files under:

/var/vcap/store/gemfire-server/<disk-store-files>

5. Start cache member with the command: monit start gemfire-server. All the partitioned and replicated region data is recovered from other running cache members.

Out of Memory Error

Out of memory error happens when a cache member can’t find free memory in Java Heap. This can happen for various reasons, some of which are:

Improper VMware Tanzu GemFire plan sizes.
JVM usage reaching to critical threshold.
Sudden load in the system.

The locator may throw ‘Out of Memory’ when, for instance:

Logs or stats are exported through gfsh without limiting the size by start/end timestamps.
Queries that return large result set,
Function that returns a large result set.
Using pulse to return a large result set.

Recovery

To temporarily recover from a ‘Out of memory’ error, start the gemfire-server/gemfire-locator and scale out the cluster in case the majority of the cache members are beyond the critical threshold (usually 95%) for heap usage.

Note: This problem may cause some of the other failures discussed on this page.

With a proper monitoring of the JVM usage, the underlying JVM usage/spike and cache member stats/logs should be analyzed and appropriate actions should be taken to prevent such errors.

Infrastructure Failures

In cases of infrastructure failure such as network, disk drives or other infrastructure related issues, one or more cache members fail to startup.

Recovery

Since there is one cache member failing, restarting a cache member should get the cluster in a healthy state.

1. Fix the underlying infrastructure issues.

2. Depending issue type and its impacts on VMware Tanzu GemFire clusters, reboot the VMware Tanzu Gemfire instance using bosh stop/start or monit stop all.

3. Start cache member with the command: monit start gemfire-locator/gemfire-server.

Recovery of multiple cache members

1. Fix the underlying infrastructure issues.

2. Stop running cache members with the command: monit stop gemfire-server.

3. Stop running locators with the command: monit stop gemfire-locator.

4. Depending issue type and its impact on VMware Tanzu GemFire clusters, reboot the VMware Tanzu GemFire instance(s) using bosh stop/start.

5. Make sure that locators and cache members are shutdown

6. Start all locators with the command: monit start gemfire-locator.

7. Start Cache members with the command: monit start gemfire-server.

Locator Failures

In cases of locator failure, an exception such as below will be logged and client applications will not be able to connect to the service instance.

Internal Server Error","message":"Unable to connect to any locators in the list [LocatorAddress [socketInetAddress=00aaa000-000a-00a0-0000-b00a0000000.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh/xx.x.x.xx:55221, hostname=50acc404-686f-48b6-8990-b08d6e4d0f8a.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh, isIpString=false], LocatorAddress [socketInetAddress=xx.x.x.xx/xx.x.x.xx:55221, hostname=xx.x.x.xx, isIpString=true]]; nested exception is org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [LocatorAddress [socketInetAddress=55ccc444-666f-44b4-9990-b08d6e4d0f8a.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh/xx.x.x.xx:55221, hostname=50acc404-686f-48b6-8990-b08d6e4d0f8a.locator-server.services.service-instance-a1111111-a111-1dd1-bb11-2c22bd4c8e0b.bosh, isIpString=false], LocatorAddress [socketInetAddress=xx.x.x.xx/xx.x.x.xx:55221, hostname=xx.x.x.xx, isIpString=true]]","path":"/brd/adhocUpdate"}

Issue

Most of the time, recreating the service instance fixes the issue. Otherwise, a support ticket may be necessary. More specifically, in cases of locator failure where the below exception is seen and the GemFire locator fails to start.

Exception in thread "main" java.lang.IllegalStateException: The init file "./BACKUPDEFAULT.if" does not exist

or 
Exception in thread "main" java.lang.IllegalStateException: The <file> file <.crf> or <.drf>" does not exist

Example

1. On the failing GemFire locator, remove the disk store files under:

/var/vcap/store/gemfire-locator/ConfigDiskDir_locator-*

2. In cases where a locator fails with an exception such as below run the following

update -service when L Error: - Unable to render templates for job 'gemfire-locator'.

Start the locator process:

monit start gemfire-locator

Recovery

1. From CF CLI, do an update on the service instance in question:

cf update-service SERVICE_INSTANCE

Blackbox Process Failure

A large number of log files and meta-log files may be created if a GemFire process, a gemfire-locator or a gemfire-server, fails repeatedly and bosh continues to restart it. This may cause the Blackbox process to fail because there are too many log files that it has to monitor and it reaches a system limit on the number of open files.

Recovery
In order to recover from this issue, we can perform the following steps:

Stop all the VMs, first Server VMs and then Locator VMs (using monit stop all).
Remove the log files (from /var/vcap/sys/log/gemfire-{locator/server}/gemfire) from each VM.
Start all the VMs, first Locator VMs and then Server VMs (In the reverse order of stopping).

Feedback

thumb_up Yes

thumb_down No