During region recovery or startup, users may observe that bucket assignment is stalled and partitioned regions fail to fully initialize. The affected servers appear to be waiting indefinitely for primary bucket election to complete. Cluster logs may show messages indicating delays or waiting threads related to bucket creation or primary assignment.
This condition can cause the affected region to remain in a recovering or initializing state for an extended period, preventing normal data operations on that region.
GemFire 10.1.4 and below.
This issue occurs due to a deadlock scenario during bucket assignment among multiple cache servers participating in partitioned region recovery.
Additionally, incomplete or delayed PartitionListener callbacks can contribute to prolonging the deadlock if they hold locks or perform blocking operations.
If you encounter this condition in your environment:
This fix will be available starting with GemFire version 10.1.5 or higher, and in 10.2.x releases.
These versions include improvements that help detect and recover from situations where servers get stuck waiting on each other during bucket recovery:
Better handling of primary bucket election — prevents servers from getting into a circular wait while deciding which one should be the primary.
Safer handling of PartitionListener callbacks — ensures that any errors or delays in user code do not block bucket assignment or cause the recovery process to hang.