How to clear stale remote locator information from WAN enabled VMware Tanzu GemFire [VMs] clusters
search cancel

How to clear stale remote locator information from WAN enabled VMware Tanzu GemFire [VMs] clusters

book

Article ID: 294371

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

This article helps identify an issue where locators of one WAN Site in VMware Tanzu GemFire are not able to communicate with remote locators due to having stale references after changing the cluster WAN topology.

Communication between local and remote locators can fail for different reasons. In this article, we consider the scenario below:

Assume you have two clusters connected via WAN (Cluster A and Cluster B):
 
  • Cluster B is unhealthy, so you decided to create a new Cluster C with the same distributed-system-id as Cluster B in order to replace Cluster B completely.
  • Cluster B is deleted after successfully starting Cluster C.
  • Now Cluster A will need to be updated with the new remote locators and Cluster C will need to be configured. Both of these actions can be done using the update-service command. The update-service command will restart the whole cluster in a rolling fashion and add Cluster B locators into Cluster A's configuration.


Symptoms:
The replication from Cluster C to Cluster A should work since C is newly created and only keeps the information of its own locators and the remote locators. However, the replication from Cluster A to Cluster C may fail since Cluster A's locators can keep the reference to Cluster B's locators even after running update-service. The reference to old locators for WAN is stored in the *.dat file on each cluster locator.

Environment

Product Version: 1.10

Cause

Locators of Cluster A will still contain the information of the old cluster (Cluster B) in their *view.dat file. Since Cluster B is already deleted, there is no DNS entry for the locators of Cluster B.

Below is an example of log statements that can be found in the server logs:
[warn 2021/01/15 18:02:11.753 UTC <Event Processor for GatewaySender_send_to_ClusterC> tid=0xba] send_to_ClusterC : Could not connect due to: Authentication error. Please check your credentials.
[info 2021/01/15 17:56:44.407 UTC <WAN Locator Discovery Thread2> tid=0xa9] Locator discovery task exchanged locator information xxx.locator.dynamic-services-tiles-network.service-instance-xxx.bosh[55221] with zzz.locator.servicenetwork.service-instance-zzz.bosh[55221]: {

1=[
aaa1.locator.servicenetwork.service-instance-aaa1.bosh[55221], 
192.0.2.1[55221], 
192.0.2.2[55221], 
bbb1.locator.servicenetwork.service-instance-bbb1.bosh[55221], 
ccc1.locator.servicenetwork.service-instance-ccc1.bosh[55221], 
192.0.2.3[55221]
]


The above log entry shows that the Event Processor is connecting to an incorrect locator (for example 192.0.2.1[55221]) which is a locator on Cluster B. 

Resolution

To resolve this issue we need to clear the reference to the old Cluster B locators from Cluster A's *view.dat files. To do this, stop all locators on both sides of the WAN (Cluster A and Cluster C), delete any state files, and restart the whole cluster in a rolling fashion starting with locators first. Please follow the below steps carefully.

Note: Performing the steps below causes an outage and the outage time depends on the size of the cluster and the region data.
 
  1. Stop all the apps, data loaders, and traffic connected to the Cluster A
  2. Stop all the GatewaySenders on both sides (gfsh>stop gateway-sender --id=send_to_ClusterC etc.)
  3. SSH to the locator VM and stop all the locators on Cluster A via BOSH CLI using "monit stop gemfire-locator".
  4. Delete the .dat file (for example, locator55221view.dat) from locator's directory "/var/vcap/store/gemfire-locator". 
  5. Repeat the above step (#4) for all the locators on Cluster A.
  6. Start the locators on Cluster A. 
  7. Restart all the servers on Cluster A.
  8. Now you can start the apps & traffic to these clusters.
Finally, verify the GatewaySenders & GatewayReceivers are running: 
gfsh>list gateways

gfsh>staus gateway-sender

gfsh>status gateway-receiver