Replace a WAN-replicated Tanzu GemFire for VMs service without data loss or down time on the healthy service instance
book
Article ID: 294378
calendar_today
Updated On:
Products
VMware Tanzu Gemfire
Issue/Introduction
How do I replace a WAN-replicated Tanzu GemFire for VMs service without data loss or downtime on a healthy service instance?
Environment
Product Version: 1.12
Resolution
Scenario
Given:
Two service instances connected through WAN, Cluster A and cluster B.
Cluster A has DSID =1 and Cluster B has DSID=2
Cluster B is irrecoverable.
Expectation:
New Cluster C will replace Cluster B
Cluster C needs to connect to Cluster A through a WAN setup
Data for WAN enabled regions needs to be consistent between Cluster A and Cluster C
No loss of queued events for Cluster A and Cluster C
No downtime for Cluster A
*DSID is the distributed-system-id
Perform the following steps in order to replace Cluster B with Cluster C without any data loss or downtime on Cluster A:
Bosh stop the irrecoverable cluster (Cluster B) or shutdown all VMs without bosh if necessary. We need this step because Cluster A maintains connection information to Cluster B. Instead of deleting the problem service instance, we stop it so that the information for this service remains within the bosh infrastructure.
For testing and demonstration purposes only, put a few records into Cluster A to simulate live traffic. In the production environment, Cluster A will continue to accept live traffic and WAN events will be queued:
put --key="1112" --value="1111" --region=/regionX
Or
Use client app to generate put operations
Make sure that the new events are queued in Cluster A via the following command from gfsh:
list gateways
Create a new Cluster C with the DSID=2. This is the same DSID as cluster B which is irrecoverable.
Important: To ensure no loss of WAN events during backup/restore, we need to make sure that the new cluster has the same DSID as the broken/irrecoverable cluster.
Restore Cluster C
Important: Do not create any service key for Cluster C after the cluster is created or before running the BBR restore operation. The BBR restore operation will not run if Cluster C is modified.
To prepare the restore script, we need to capture the following information:
The guid of the new Cluster C.
cf service <service-instance-id> --guid
The backup artifacts path (from the archive of backup)
Run the restore via script. Below is an example of the restore script:
Note: It will give an error like the one below. These failures are caused by the conflict of DSID and remote DSID when we try to restore Cluster C from Cluster A. We will address this error and conflict in the later steps. However, the servers of Cluster C will have all the necessary data from Cluster A at this point.
Error attempting to run post-restore-unlock for job gemfire-server on
<xxx>: No Members Found -
exit code 1error 2:
Error attempting to run post-restore-unlock for job gemfire-server on
<xxx>: No Members Found -
exit code 1error 3:
The server VMs of Cluster C will be in a “failing” state and we can check the status of all the VMs using the following command:
bosh -d <service-instance-id> is --ps
In order to restore the gemfire-server process, perform the following steps:
ssh into the locator VMs of Cluster C and “monit stop all” to stop the locator VM process.
Clean the conflicting cluster configurations in the locator VMs by removing the cluster config directory using the following command:
rm -r /var/vcap/store/gemfire-locator/*
Repeat Step 1 and Step 2 for all the locator VMs.
ssh into the locator VMs of Cluster C and “monit start all” to start the locator VM process. This step will need to be done for all the locator VMs.
ssh into the server VMs of Cluster C and “monit start all” to start the server VM process. This step will need to be done for all the server VMs.
Create a service key for the Cluster C.
cf create-service-key <service> <key>
Login to gfsh using the connect string from the service key of Cluster C. Because of the conflicting cluster configuration caused by the BBR-restore operation, we need to recreate the cluster configuration of Cluster C from scratch.
Also, we need to clean the WAN diskstore on Cluster C since the BBR-restore process will copy the WAN diskstore unnecessarily. The proper WAN events need to be recovered by the WAN event replication process.
In doing so, we need to re-run any existing gfsh scripts/commands that were used to configure Cluster B. Use gfsh to destroy gateway disk store to remove the stale gateway disk store.
The following are example steps to recover region data and clean out WAN events:
Create and destroy disk store for WAN Gateways (remove any stale data from WANs diskstore):
Create region and use the sender-id =send_to_ClusterA
create region --name=regionX --type=PARTITION_PERSISTENT
--disk-store=regionX_disk --gateway-sender-id=send_to_ClusterA
Describe regions to verify if the data is loaded in the regions.
describe region <region>
Update the service for Cluster A for the information (remote locators + trusted senders +recursors) of Cluster C. Please refer the following link for more details:
Now at this point, Cluster C should receive all the events from Cluster A and is fully replaced. We can verify this using gfsh to check if regions from both clusters have the same data.