Replace a WAN-replicated Tanzu GemFire for VMs service without data loss or down time on the healthy service instance
search cancel

Replace a WAN-replicated Tanzu GemFire for VMs service without data loss or down time on the healthy service instance

book

Article ID: 294378

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

How do I replace a WAN-replicated Tanzu GemFire for VMs service without data loss or downtime on a healthy service instance?

Environment

Product Version: 1.12

Resolution

Scenario

Given:
 

  • Two service instances connected through WAN, Cluster A and cluster B.
  • Cluster A has DSID =1 and Cluster B has DSID=2
  • Cluster B is irrecoverable.


Expectation:
 

  • New Cluster C will replace Cluster B
  • Cluster C needs to connect to Cluster A through a WAN setup
  • Data for WAN enabled regions needs to be consistent between Cluster A and Cluster C
  • No loss of queued events for Cluster A and Cluster C
  • No downtime for Cluster A

*DSID is the distributed-system-id

Perform the following steps in order to replace Cluster B with Cluster C without any data loss or downtime on Cluster A:
 

  1. Bosh stop the irrecoverable cluster (Cluster B) or shutdown all VMs without bosh if necessary. We need this step because Cluster A maintains connection information to Cluster B. Instead of deleting the problem service instance, we stop it so that the information for this service remains within the bosh infrastructure.
  2. Backup Cluster A
    • Steps to backup the cluster can be found at https://docs.pivotal.io/p-cloud-cache/1-10/backupandrestore.html. Below is an example script:

      BOSH_CLIENT_SECRET=<secret> bbr deployment --target <ip> --username <ops_manager> --ca-cert /var/tempest/workspaces/default/root_ca_certificate --deployment <service-instance> backup

       

  3. For testing and demonstration purposes only, put a few records into Cluster A to simulate live traffic. In the production environment, Cluster A will continue to accept live traffic and WAN events will be queued:
    put --key="1112" --value="1111" --region=/regionX
    Or
    Use client app to generate put operations
  4. Make sure that the new events are queued in Cluster A via the following command from gfsh:
    list gateways
  5. Create a new Cluster C with the DSID=2. This is the same DSID as cluster B which is irrecoverable.
    • Important: To ensure no loss of WAN events during backup/restore, we need to make sure that the new cluster has the same DSID as the broken/irrecoverable cluster.
  6. Restore Cluster C
    1. Important: Do not create any service key for Cluster C after the cluster is created or before running the BBR restore operation. The BBR restore operation will not run if Cluster C is modified.

    2. To prepare the restore script, we need to capture the following information:
      • The guid of the new Cluster C. 

        cf service <service-instance-id> --guid
      • The backup artifacts path (from the archive of backup)

    3. Run the restore via script. Below is an example of the restore script: 
      BOSH_CLIENT_SECRET=<client_secret> bbr deployment --target
      <IP> --username <ops_manager> --ca-cert
      /var/tempest/workspaces/default/root_ca_certificate --deployment
      <service> restore --artifact-path
      <artifact_from_backup>
    4. Let this restore task finish.

      Note: It will give an error like the one below. These failures are caused by the conflict of DSID and remote DSID when we try to restore Cluster C from Cluster A. We will address this error and conflict in the later steps. However, the servers of Cluster C will have all the necessary data from Cluster A at this point.

      Error attempting to run post-restore-unlock for job gemfire-server on
      <xxx>: No Members Found -
      exit code 1error 2:
      Error attempting to run post-restore-unlock for job gemfire-server on
      <xxx>: No Members Found -
      exit code 1error 3:
    5. The server VMs of Cluster C will be in a “failing” state and we can check the status of all the VMs using the following command:

      bosh -d <service-instance-id> is --ps 

       

  7. In order to restore the gemfire-server process, perform the following steps:
    1. ssh into the locator VMs of Cluster C and “monit stop all” to stop the locator VM process.

    2. Clean the conflicting cluster configurations in the locator VMs by removing the cluster config directory using the following command: 
      rm -r /var/vcap/store/gemfire-locator/*
    3. Repeat Step 1 and Step 2 for all the locator VMs.
    4. ssh into the locator VMs of Cluster C and “monit start all” to start the locator VM process. This step will need to be done for all the locator VMs.

    5. ssh into the server VMs of Cluster C and “monit start all” to start the server VM process. This step will need to be done for all the server VMs.
  8. Create a service key for the Cluster C. 
    cf create-service-key <service> <key>
  9. Login to gfsh using the connect string from the service key of Cluster C. Because of the conflicting cluster configuration caused by the BBR-restore operation, we need to recreate the cluster configuration of Cluster C from scratch.

    Also, we need to clean the WAN diskstore on Cluster C since the BBR-restore process will copy the WAN diskstore unnecessarily. The proper WAN events need to be recovered by the WAN event replication process.

    In doing so, we need to re-run any existing gfsh scripts/commands that were used to configure Cluster B. Use gfsh to destroy gateway disk store to remove the stale gateway disk store.

    The following are example steps to recover region data and clean out WAN events:
    1. Create and destroy disk store for WAN Gateways (remove any stale data from WANs diskstore):

      create disk-store --name=gateway_disk --dir=./gateway
      destroy disk-store --name=gateway_disk
      create disk-store --name=gateway_disk --dir=./gateway
    2. Create disk-stores for regions:

      create disk-store --name=regionX_disk --dir=./regionX
    3. Create gateway-sender:
      create gateway-sender --id=send_to_ClusterA disk-store-name=gateway_disk
      --remote-distributed-system-id=1 --enable-persistence=true
    4. Create gateway receiver:
      create gateway-receiver
    5. Create region and use the sender-id =send_to_ClusterA 
      create region --name=regionX --type=PARTITION_PERSISTENT
      --disk-store=regionX_disk --gateway-sender-id=send_to_ClusterA
    6. Describe regions to verify if the data is loaded in the regions. 
      describe region <region>
    7. Update the service for Cluster A for the information (remote locators + trusted senders +recursors) of Cluster C. Please refer the following link for more details:

      https://docs.pivotal.io/p-cloud-cache/1-10/WAN-bi-TLS-setup.html

      Note: Make sure that the service instance is healthy. Below is an example of “cf update-service”:

      cf update-service -c ‘ {{
      "remote_clusters": [
      {
      "recursors": {
      "service.service-instance-abc.bosh": [
      "12.456.78.910:1053",
      "12.345.67.890:1053",
      "12.345.67.980:1053"
      ]
      },
      "remote_locators": ["
      "id1.1ocator.service-subnet.bosh[55221]",
      "id2.1ocator.service-subnet.bosh[55221]",
      "id3.1ocator.service-subnet.bosh[55221]"
      ],
      "trusted_sender_credentials": [
      {
      "password": "xxx",
      "username": "gateway_sender_yyy"
      }
      ]
      }
      ]
      }’
  10. Ensure that the update service of Cluster A is completed before performing other actions to ensure that there is no loss of WAN events.
  11. Then refresh the service key for Cluster A by deleting and recreating the service key.
    cf delete-service-key ClusterA k1
    cf create-service-key ClusterA k1
  12. Update the service for Cluster C for the information (remote locators + trusted senders + recursors) of Cluster A. Please refer the following link for more details: https://docs.pivotal.io/p-cloud-cache/1-10/WAN-bi-TLS-setup.html
  13. Then refresh the service key for Cluster C by deleting and recreating the service key 
    cf delete-service-key ClusterC k1
    cf create-service-key ClusterC k1
  14. Now at this point, Cluster C should receive all the events from Cluster A and is fully replaced. We can verify this using gfsh to check if regions from both clusters have the same data.