How to migrate WAN connected GemFire clusters to new hosts with no downtime

Products

VMware Tanzu Gemfire

Issue/Introduction

Given two active-active, WAN-connected VMware GemFire clusters, how can these nodes be migrated to new hosts or VMs?

How can this be done so that, while traffic is redirected to Site B and the members in Site A are being migrated, those in Site B continue to service clients and visa versa, with no loss of data in flight?

Environment

Product Version: 9.9

Resolution

The first step of this scenario is to prepare the new clusters ahead of time, so that data can be exported from the old cluster and imported into the new cluster while the live cluster queues WAN events.

The key to avoid losing data is stop the receivers on the new clusters and then redirect WAN replication to them before doing the import or export on the target site. WAN events are queued while the data is being migrated.

Before starting the migration, you should have the following set of current or live clusters:

Site A - Cluster A with senders to cluster B (in site B).
Site B - Cluster B with senders to cluster A (in site B).

Prepare the new clusters for the migration

Note: These steps below can be done in parallel.

1. For cluster A, follow the steps below:

a. Identify all gateway senders from cluster A to cluser B

 gfsh list gateways

b. Create cluster A on the new hosts/VMs with the desired configuration and a new distributed-system-id distinct from the existing clusters, with senders to cluster B corresponding to those from cluster A to cluster B.

c. Stop receiver on cluster A:

 gfsh stop gateway-receiver

2. For cluster B, follow the steps below:

a. Identify all gateway senders from cluster B to cluster A.

 gfsh list gateways

b. Create cluster B on the new hosts/VMs with the desired configuration and a new distributed-system-id distinct from the existing clusters, with senders to cluster A corresponding to those from cluster B to cluster A.

c. Stop receiver on cluster B.

 gfsh stop gateway-receiver

3. Finally, update the remote-locators property on the existing clusters.

Note: This step requires restarting the locators in the live clusters, which is generally safe but should best be done when the load is lowest.

Cluster A: Roll Cluster A locators to update remote the locators property.

Stop locator
Add site A' to remote-locators property
Restart locator
Repeat with next locator, until done

Cluster B: Roll Cluster B locators to update remote the locators property.

Stop locator
Add site A' to remote-locators property
Restart locator
Repeat with next locator, until done

Begin Migration

Note: The live cluster will be queuing events while the migration is being performed on the inactive site. It is strongly recommended that this be done during a window of relatively low traffic and when the live nodes have sufficient capacity to support the growing queues.

Phase I (Migrate Site A)

1. Direct all traffic to site B.

2. For each sender from cluster B to cluster A, create an equivalent sender from cluster B to cluster A.

gfsh create gateway-sender --id=<sender id> --remote-distributed-system-id=<remote id>

3. Stop senders from cluster B to cluster A by doing the below for each sender from cluster B to cluster A:

gfsh stop gateway-sender --id=<sender id>

4. Export from site A and import to cluster A.

5. Start receiver on cluster A:

gfsh start gateway-receiver

6. Configure clients to use cluster A for site A, and shutdown site A

Phase II (Migrate Site B)

1. Direct all traffic to Site A, which should now point to cluster A.

2. Export from site B and import to cluster B.

3. Start receiver on cluster B:

gfsh start gateway-receiver

4. Configure clients to use cluster B for site B, and shutdown site B.

5. Finally, open up traffic to both sites and resume normal operations.