Client Timeout and Duplicate Event Messages in GemFire
book
Article ID: 408067
calendar_today
Updated On:
Products
VMware Tanzu Gemfire
Issue/Introduction
GemFire clients may disconnect from servers or experience retries when performing region operations such as PutAll, RemoveAll etc.
Typical symptoms in the logs include:
-
Client timeouts / forced disconnection by server
-
Slow server response detection
[warn ... <Timer-0>] 15 seconds have elapsed waiting for a response from … for .. thread .. ServerConnection ..
-
Event replay warnings
-
Client-side retries / failures
Cause
This behavior generally occurs when clients are unable to process server events fast enough. Contributing factors include:
- Very small async queue size (async-max-queue-size=8) , leading to queue overflow under heavy load.
- No async distribution timeout (async-distribution-timeout=0) , slow receivers are never timed out.
- Low client timeout threshold (default 10s) , clients disconnect during transient delays.
- Unresponsive clients not removed (remove-unresponsive-client=false) , holding server resources.
- Large batch operations (PutAll, RemoveAll) increasing processing latency.
Resolution
To mitigate these issues, apply the following changes:
- Increase async queue size
- Set async-max-queue-size to a larger value to buffer bursts of events.
- Enable async distribution timeout
- Configure async-distribution-timeout to disconnect slow receivers gracefully.
- Allow cleanup of unresponsive clients
- Set remove-unresponsive-client=true so servers can reclaim resources.
- Tune client/server timeouts
- Increase client socket/connection timeouts to accommodate GC pauses or network delays.
- Optimize large operations
- Consider breaking large PutAll requests into smaller batches.
Please Note: These changes are iterative in nature and would require thorough testing.
Feedback
thumb_up
Yes
thumb_down
No