HA function execution internals
search cancel

HA function execution internals

book

Article ID: 294028

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

The purpose of this article is to explain the details of how function HA works.


Resolution

Highly-available functions (i.e. those that return Function.isHA as true) are required to be idempotent because they will be retried on all the nodes after a node failure. This is due to the fact that buckets can be rebalanced and the re-execution may need to run on a different set of servers than before, so the system cannot simply exclude just the members on which function was successful. Even if the set of servers is same, the routing object to server mapping can be different due to primary bucket rebalancing.

Moreover, the execution cannot differentiate between a set of changes from one server and those from another in the custom ResultCollector. For this reason ResultCollector.clearResults must be implemented correctly to clear all the results. In addition, for functions that have some side-affects (e.g. updates), it is always possible that the failure occured mid-function on a server. Hence, that again means partial retry is not possible, since the re-execution will have no knowledge of how much went through on the previous try on any failed servers.

For a function that is not idempotent, some possible options are:

  1. Design the function in a way so that it can accurately determine what needs to be re-done in case of partial retry. In other words, keep track of all changes that went through (e.g. recording in an external file/DB) --- especially for a partial apply on a failed node. Then, skip those changes in a retry when FunctionContext.isPossibleDuplicate() is true. Again, you should note that the set of servers (or routing object to server mapping) in a retry may be completely different from the previous try due to bucket/primary rebalancing, so the function cannot rely on skipping successful nodes and must to go through each change to determine if it went through or not.

    The FunctionContext#isPossibleDuplicate() method is used to identify whether this is a re-execution and needs to be used in conjunction with Function#isHA() as true. It specifies whether the function is eligible for re-execution.

    When a failure occurs (such as an execution error or member crash while executing), the system responds as follows:

    1. Wait for all calls to return,
    2. set a boolean indicating a re-execution is being done,
    3. call the result collector’s clearResults method, and
    4. execute the function
  2. Use transactions with isHA() as false. Then rollback and retry the entire transaction at the application level. This assumes that any side-affects done by the function are also transactional and will all be rolled back on failure. GemFire transactions are still required to be colocated, unlike in GemFireXD, so this is not a valid option if the function is not routed to a single node with colocated data.

Additional Information

Applies To

GemFire 7 and 8