GemFire: Operations to Region Slow Due to Single Timer Design

Products

VMware Tanzu Gemfire Gemfire Pivotal GemFire VMware GemFire Enterprise Edition

Issue/Introduction

Customer can set Time To Live to specific region or several regions during production time. (Understanding Time To Live in GemFire: https://docs.vmware.com/en/VMware-GemFire/10.1/gf/developing-expiration-chapter_overview.html)
During the configuration being applied, users may see operation(example: puts) delays significantly.

Looking at the artifacts, you can observe that the resources CPU/RAM being consumed really messed way and no patterns like what we would see in a stable cluster.

Search tags:
time to live,
entryTimeToLive(),
getAttributesMutator().setEntryTimeToLive(),
getAttributesMutator().setCustomEntryTimeToLive.

Environment

Depending on from the R&D team, it can happen to any current version of GemFire.
Product version: 9.15, 10.0, 10.1
OS: ALL

Cause

As per the investigation result, it was due to single timer design.

When a custom expiration is being set we have to iterate over all entries in the region, and reset the expiration. Expiration is handled using Java's Timers and when Timer.schedule is called, it takes a lock.

Currently, we effectively use a single Timer for the whole cache so if multiple regions are changing their expiration simultaneously, they will all end up contending for this one lock.

Additionally, when a customer expiration is being set, the loop to iterate entries is fairly tight which means that the thread doing this work ends up appearing to hold the Timer lock for a relatively long time. This is why the user sees their puts taking longer when only a single region is having its expiration updated. The thread doing the put also ends up needing to set expiration and thus also contends for the Timer lock. Since we're only using a single Timer across the whole cache it doesn't make a difference which region is having it's expiration changed and which region the put is being performed on as they end up contending for the same lock.

Resolution

To identify if the reported cluster is impacted by the issue, please collect all the information needed, in case we need to escalate.
Such as:
- Artifacts collected during the configuration being applied and when no changes being applied with timelines of the tests with timezone, and make sure the artifacts covers the test time.
- The way of customer applying the changes. If there is any codes envolved, get those too.
- The region name customer thinks the delay symptom being observed.

Also, please check on this KB as well, maybe related to this: https://knowledge.broadcom.com/external/article/294463/gemfire-set-expirythreads-when-using-exp.html

Long term:
R&D has confirmed that we will improve this, probably by having a Timer for each region. We have following up Jira tickets already, still no ETA. But, the change will be applied on both 9.15 and 10.1 coming patch releases.

Short term:
It is always really recommended that not applying any changes during the production runs. It is always better to schedule maintenance hour or make the change during non-peak hour at least.

But there can be users implement GemFire clusters on 7*24 environment and not able to schedule maintenance frequently. While the expiration attributes are being mutated, any cache write operations like put/remove on any regions that are using expiration will be impacted. Once the mutation is complete the performance should go back to normal. During mutation one region at a time should reduce the resource contention across multiple threads, giving more chance to put/remove operation to be completed sooner.