After a scaling server update in a GemFire cluster, startup took several hours even after removing disk stores. Sample error messages in logs indicate persistent bucket recovery delays, e.g.,
“PersistentBucketRecoverer for region R>] Region T (and any colocated sub-regions) has potentially stale data. Buckets [AA, XX, BB, YYY, ZZZ, CCC, DDD, EEE] are waiting for another offline member to recover the latest data.”
Applicable to all supported GemFire versions.
GemFire tracks persistent bucket ownership through cluster configuration metadata and disk store metadata files (.crf, .drf, .if). When servers are replaced or scaled out and disk stores are deleted manually, the cluster configuration still retains ownership metadata pointing to the old (now offline) servers.
As a result:
This is a known issue and will be resolved in a future release of Tanzu GemFire. Subscribe to this article to receive updates. Before the fix is available, follow the steps below to prevent prolonged startup times during scaling or server replacement.