Why shouldn't one run more than two process engines for a Clarity environment's BG services / cluster when there are lots of processes running at the same time?
Is there any Whitepaper on why this isn't recommended and the impact of having multiple process engines?
Release: All
Component: Clarity Processes
This KB explains the potential pitfalls of having too many Process Engines. This is definitely a case of more is not better. Here's why:
2. In addition, there are 2 other thread pools not seen in the Process Engine statistics:
3. Next, here is some additional information about how the process engines choose which processes to run:
The process engine's load balancing capabilities are essentially a foot race. When an update event is generated (i.e. a user updates a process-enabled object instance like a project), a row is inserted into NMS_MESSAGES table in the database that contains details about the event. After that a multicast message is sent indicating to the process engines that an event occurred.
Each process engine has a thread called the NMS Message Receiver that will receive this multicast message and act on it. The very first thing it does is to execute a SELECT against NMS_MESSAGES to retrieve any undelivered messages. It "delivers" them to itself by inserting a new row into NMS_MESSAGE_DELIVERY that is keyed for the process engine it's being delivered for.
Once the messages are delivered, each process engine then iterates over the new messages it received, determining if any action should be taken. In each case, it must evaluate the event. So in a cluster with N process engines, you will have all N process engines evaluating each new message simultaneously. For INSERT or UPDATE events, the process engine must iterate through the list of eligible Process Definitions for that particular type of event, evaluate their Start Condition expression (which also performs a rights check via an SQL query). This evaluation is dispatched into the Condition Evaluation thread pool, which can grow up to 15 threads large. If the condition evaluates to "true", the process engine attempts to start the process by inserting a row into BPM_EM_EVENT_PROCESS_LOCKS table. Whichever Process Engine succeeds in inserting the lock is the one that successfully starts the process.
NOTE: The more process engines present, the greater the database contention is on the NMS_MESSAGES, NMS_MESSAGE_DELIVERY, BPM_EM_EVENT_PROCESS_LOCKS tables. This affects the overall database as well because all of the process engines are attempting to evaluate the start conditions on the process definitions tied to the object type receiving the event.
Now we have enough information to explain why we tell customers to run a maximum of two Process Engines. Two process engines provide redundancy in case one of your process engines goes down as well as majorly reducing contention.
Now let's look at the math that shows why you should not run more than two process engines by looking at an example scenario.
EXAMPLE SCENARIO:
For this scenario, suppose we have an interface defined into Clarity that imports Requisitions. Each hour, a batch of requisitions is xogged into Clarity. Suppose each batch contains 1000 requisition instances for update.
In the Clarity design, there are 25 active process definitions defined on the Requisition object for "update" for various start conditions. Out of the 25 definitions, based on the start conditions only 1 will ever evaluate to true at any time causing just a single process to run.
When the XOG run happens, 1000 requisition instances are xogged in for update. The requisition xog is very quick because the object is relatively light weight. Each XOG instance update triggers a corresponding BPM "update" event to be fired which amounts to a single row in NMS_MESSAGES inserted per instance updated and a multicast event message being sent by the app instance that handled the XOG request.
This results in 1000 rows inserted into NMS_MESSAGES and 1000 multicast messages that are received by the 8 process engines simultaneously. The NMS Message Receiver thread on each process engine receives these very lightweight UDP multicast messages very quickly and begins retrieving rows from NMS_MESSAGES to deliver to itself by inserting rows in NMS_MESSAGE_DELIVERY. It also begins dispatching these event messages pulled from the database to the Condition Evaluation thread pool. On each process engine, the queue for the Condition Evaluation thread pool quickly grows to handle the backlog of 1000 new messages, resulting in all 15 possible Condition Evaluation threads being active on each process engine.
Let's look at the numbers now:
8 process engines * 15 Condition Evaluation threads = 120 concurrent threads evaluating the 1000 new messages
However, the load climbs to 120 concurrent evaluation threads and stays there for some time because it's not just that there are 1000 new messages to process. There are 25 process definitions to evaluate for each of the 1000 messages:
8 process engines * 25 process start conditions * 1000 messages = 200,000 total queries executed.
Assuming the database server has 24 CPUS and that the condition expression queries hypothetically take 200ms each (CPU bound on the database), we're now looking at 666 minutes of total execution time. 666 minutes / 24 CPUs = 27 minutes of solid CPU activity on the db server to handle the incoming load of messages. The load average on the database server can potentially rise to over 120 for a long period of time. Other aspects of Clarity will suffer from CPU starvation on the database server and all activities will be affected and slowed down.
As you can see, this is not a good situation.
Let's look at the same scenario with only 2 process engines:
2 PEs * 15 eval threads = 30 concurrent threads
2 PEs * 25 start conds * 1000 messages = 50,000 queries
166 minutes cpu / 24 cpus = 6.9 minutes total CPU time (db server load average could at a load average of 30 or higher during this time due to the 30 concurrent eval threads)
6.9 minutes of CPU at a load of 30 is still bad. Here is what you can do to reduce this load:
It turns out the 2 biggest variables at play are: