Client Timeout and Duplicate Event Messages in GemFire
search cancel

Client Timeout and Duplicate Event Messages in GemFire

book

Article ID: 408067

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

GemFire clients may disconnect from servers or experience retries when performing region operations such as PutAll, RemoveAll etc.

Typical symptoms in the logs include:

  • Client timeouts / forced disconnection by server

    [warn ... <ClientHealthMonitor Thread>] Server connection from [identity(...)] is being terminated because its client timeout of 10000 has expired.
  • Slow server response detection 

     [warn ... <Timer-0>] 15 seconds have elapsed waiting for a response from … for .. thread .. ServerConnection ..
     
  • Event replay warnings

     
    [info ... <ServerConnection>] Event has previously been seen for region=...; operation=PUTALL_CREATE
  • Client-side retries / failures

    Server unreachable: could not connect after 3 attempts at org.apache.geode.cache.client.internal.OpExecutorImpl.handleException(...)

Cause

This behavior generally occurs when clients are unable to process server events fast enough. Contributing factors include:

  • Very small async queue size (async-max-queue-size=8) , leading to queue overflow under heavy load.
  • No async distribution timeout (async-distribution-timeout=0) , slow receivers are never timed out.
  • Low client timeout threshold (default 10s) , clients disconnect during transient delays.
  • Unresponsive clients not removed (remove-unresponsive-client=false) , holding server resources.
  • Large batch operations (PutAll, RemoveAll) increasing processing latency.

 

Resolution

To mitigate these issues, apply the following changes:

  1. Increase async queue size
    • Set async-max-queue-size to a larger value to buffer bursts of events.
  2. Enable async distribution timeout
    • Configure async-distribution-timeout  to disconnect slow receivers gracefully.
  3. Allow cleanup of unresponsive clients
    • Set remove-unresponsive-client=true so servers can reclaim resources.
  4. Tune client/server timeouts
    • Increase client socket/connection timeouts to accommodate GC pauses or network delays.
  5. Optimize large operations
    • Consider breaking large PutAll requests into smaller batches.

Please Note: These changes are iterative in nature and would require thorough testing.