In multi-site GemFire clusters, users may experience significant delays in data synchronization between clusters (e.g., Primary to Secondary or across geographic regions).
Common Symptoms:
High Latency: Data updates take minutes or hours to reach remote sites.
Queue Backups: Large gatewayQueueSize metrics that fail to drain during peak traffic.
Resource Imbalance: Primary nodes show high ioWait, while receiver nodes remain underutilized.
The bottleneck is mostly an insufficient number of dispatcher threads on the Gateway Sender.
Each dispatcher thread handles a batch of events and must wait for an ACK from the remote receiver before processing the next batch. In high-volume environments or those with high network round-trip times, a low thread count (e.g., the default) cannot "push" data fast enough to keep up with the rate of local updates, leading to a bottleneck even if the receiver cluster has capacity.
Increase the concurrency of the Gateway Sender to maximize throughput during traffic bursts.
<gateway-sender id="ExampleSender"
dispatcher-threads="30"
enable-batch-conflation="true"
batch-size="5000"
... />
Please note: This as a tuning exercise. If the queue continues to grow during bursts despite the initial increase, continue to incrementally increase the dispatcher thread count until the throughput matches the event generation rate.