GemFire: Deadlock During Bucket Assignment Due to Circular Wait on Primary Election During Recovery
search cancel

GemFire: Deadlock During Bucket Assignment Due to Circular Wait on Primary Election During Recovery

book

Article ID: 413988

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

During region recovery or startup, users may observe that bucket assignment is stalled and partitioned regions fail to fully initialize. The affected servers appear to be waiting indefinitely for primary bucket election to complete. Cluster logs may show messages indicating delays or waiting threads related to bucket creation or primary assignment.

This condition can cause the affected region to remain in a recovering or initializing state for an extended period, preventing normal data operations on that region.

Environment

GemFire 10.1.4 and below. 

Cause

This issue occurs due to a deadlock scenario during bucket assignment among multiple cache servers participating in partitioned region recovery.

  • Each server is attempting to elect primaries for different sets of buckets.
  • Due to timing and synchronization during recovery, two or more servers may end up waiting on primary election responses from each other for the same or dependent buckets.
  • As a result, bucket assignment stalls, causing the region to never complete initialization.

Additionally, incomplete or delayed PartitionListener callbacks can contribute to prolonging the deadlock if they hold locks or perform blocking operations.

Resolution

If you encounter this condition in your environment:

  1. Check server logs for threads stuck in bucket creation or primary election routines.
  2. Review any PartitionListener implementations to ensure they handle exceptions gracefully.

Additional Information

This fix will be available starting with GemFire version 10.1.5 or higher, and in 10.2.x releases.

These versions include improvements that help detect and recover from situations where servers get stuck waiting on each other during bucket recovery:

  • Better handling of primary bucket election — prevents servers from getting into a circular wait while deciding which one should be the primary.

  • Safer handling of PartitionListener callbacks — ensures that any errors or delays in user code do not block bucket assignment or cause the recovery process to hang.