Recovery and Scaling Procedures for Tanzu GemFire Clusters with Large Partitioned Regions

Products

VMware Tanzu Gemfire

Issue/Introduction

Parallel addition of new servers in clusters that use redundancy zones may cause cluster bringup failures, particularly affecting large partitioned regions. Common symptoms include:

Members unexpectedly shutting down shared, unordered connections
New servers exhibiting stuck GII (GemFire Initial Image) sender threads, blocking region bucket distribution

These issues can lead to cluster instability and degraded application performance.

Environment

Tanzu GemFire version 10.1.0

Cause

The problems typically stem from disk store overflow conditions—where region data spills to disk due to exceeding memory limits—combined with GII storms, i.e., simultaneous bucket image transfers to multiple joining members.

By default, if multiple servers join in parallel and the property startup-recovery-delay is set to zero, GemFire may assign almost all redundancy recovery tasks to the first member started. This causes uneven load distribution, bucket file corruption, or lock contention. Disk I/O saturation further exacerbates recovery stalls and data inconsistencies.

Resolution

Option A: Controlled Parallelism Using startup-recovery-delay

Set the property startup-recovery-delay to -1 at the region or cluster level. This disables automatic redundancy recovery immediately after a new member joins.
With this setting, redundancy recovery is only performed via explicit rebalance or redundancy restore operations, allowing administrators to fully control when workload is distributed.
After expanding the cluster, run the following gfsh commands to restore redundancy and rebalance buckets across members, respecting redundancy zone placement rules. Here are reference gfsh commands.

gfsh> rebalance --include-redundant=true

gfsh> restore redundancy

Option B: Manual Cleanup and Staged Bringup

Stop all new servers to clear stuck GII senders, resolve lock contention, and reset cluster state.
Delete disk stores on stopped servers to remove corrupted, bloated, or inconsistent bucket files from failed GII attempts. Note: Deleting disk stores results in loss of local bucket data, so ensure backups or data recovery plans.
Restart servers in small batches (recommended 2–3 nodes at a time) to reduce parallel GII and rebalance load and minimize disk overflow or lock contention risks.
Continuously monitor disk and memory usage during rebalance or region initialization via GemFire metrics. Here is a reference gfsh command.

gfsh> show metrics --type=disk

Option C: Fine-Grained Control of Redundancy Recovery

You can disable automatic redundancy recovery at the region configuration level for precise control. Here is an Example.

<region-attributes refid="PARTITION">

<partition-attributes startup-recovery-delay="-1"/>

</region-attributes>

</region>

This forces redundancy recovery only through manual or automated rebalance operations.
Automate safe rebalance operations by integrating health checks into scheduled scripts or SOPs. For example:

Verify if any buckets lack redundancy.
Check for member bucket imbalances or zero-bucket members.
Assess memory imbalance across servers against thresholds.
Confirm cluster majority membership and redundancy zone health before rebalance.

Reference example implementation and scripts:

GemFire Health Check GitHub Repository

Recommendation

Upgrade to Tanzu GemFire 10.2, which includes critical fixes that mitigate some of these recovery and redundancy stalls:

- GEM-15458: Resolved deadlocks caused by circular wait during bucket assignment and primary election.
- GEM-7306: Fixed member hang during bucket recovery synchronization with killed members.
- GEM-13963: Prevented data mismatch when members shut down during GII.

For upgrade details and release notes, see:

Tanzu GemFire 10.2 Release Notes