Parallel addition of new servers in clusters that use redundancy zones may cause cluster bringup failures, particularly affecting large partitioned regions. Common symptoms include:
These issues can lead to cluster instability and degraded application performance.
Tanzu GemFire version 10.1.0
The problems typically stem from disk store overflow conditions—where region data spills to disk due to exceeding memory limits—combined with GII storms, i.e., simultaneous bucket image transfers to multiple joining members.
By default, if multiple servers join in parallel and the property startup-recovery-delay is set to zero, GemFire may assign almost all redundancy recovery tasks to the first member started. This causes uneven load distribution, bucket file corruption, or lock contention. Disk I/O saturation further exacerbates recovery stalls and data inconsistencies.
gfsh> rebalance --include-redundant=true
gfsh> restore redundancy
gfsh> show metrics --type=disk
<region name="PR1">
<region-attributes refid="PARTITION">
<partition-attributes startup-recovery-delay="-1"/>
</region-attributes>
</region>