GemFire startup took significantly longer initialization times after the scaling or server replacement
search cancel

GemFire startup took significantly longer initialization times after the scaling or server replacement

book

Article ID: 417452

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

After a scaling server update in a GemFire cluster, startup took several hours even after removing disk stores. Sample error messages in logs indicate persistent bucket recovery delays, e.g.,

“PersistentBucketRecoverer for region R>] Region T (and any colocated sub-regions) has potentially stale data. Buckets [AA, XX, BB, YYY, ZZZ, CCC, DDD, EEE] are waiting for another offline member to recover the latest data.”

 

Environment

Applicable to all supported GemFire versions.

Cause

GemFire tracks persistent bucket ownership through cluster configuration metadata and disk store metadata files (.crf, .drf, .if). When servers are replaced or scaled out and disk stores are deleted manually, the cluster configuration still retains ownership metadata pointing to the old (now offline) servers.
As a result:

  • The new server hosting those bucket(s) sees no local persistent data.
  • The last known owner (old server) is offline.
  • The server waits indefinitely or until a long timeout to recover these buckets to avoid data loss, causing significant startup delays.

Resolution

This is a known issue and will be resolved in a future release of Tanzu GemFire. Subscribe to this article to receive updates. Before the fix is available, follow the steps below to prevent prolonged startup times during scaling or server replacement. 

  1. Avoid Manually Deleting Disk Store Files
    Always use the GFSH command destroy disk-store to safely remove disk stores. Manual file deletion can leave the cluster metadata inconsistent, causing recovery delays.
  2. Pre-Revoke Missing Disk Stores Before Server Replacement
    Before decommissioning or replacing a server, run the GFSH command revoke missing-disk-store on the old member. This removes metadata about missing disk stores from the cluster configuration, so the cluster will not wait for recovery of offline members’ buckets on startup.
  3. Ordered Shutdown and Controlled Startup
    Perform an ordered shutdown using the gfsh shutdown command to position disk stores for faster recovery. When restarting, start all members with persistent regions roughly at the same time, or in controlled batches, to optimize parallel recovery.
  4. Monitor and Tune Disk Store Settings
    Monitor disk usage thresholds and consider file compaction and backup strategies to improve disk store health, which can affect startup performance.

Additional Information

References:

 

Resolving Server Startup Issues Due to Missing or Corrupted Disk-Store Files in GemFire

GemFire slow startup after deleting regions without revoking disk store

Recovery and Scaling Procedures for Tanzu GemFire Clusters with Large Partitioned Regions

Optimizing a System with Disk Stores