NSE Registration and Processing in SMP 8.x

Products

IT Management Suite

Issue/Introduction

Problem statement

On heavily populated systems there was always a bottleneck between NSE registration and processing logic, related to the common statistic table [EventQueue], which is populated and modified by both sides.

Environment

ITMS 8.0, 8.1, 8.5, 8.6

Resolution

Solution

In release ITMS 8.0 we removed the absolute (real-time) synchronization for the [EventQueue] table for both sides – instead, it will be updated on a timely basis as well as for some corner cases by a specific call to “update query”.

Pros

There will be no SQL locks between the registration/processing logic sides
Transactions will be shorter and faster
Fewer consistency problems
Statistics table content will not be real-time

Cons

Statistics table content will not be real-time

EventQueueEntry – processing state column

“ProcessingState” of the event entry in this table will only have two values:

0 – pending
1 - processing

Event Registration without metadata

It will be a simple insert of a single row into the “EventQueueEntry” table.

Note: This call will be executed without transaction but with deadlock retries.

Event Registration with metadata

It will be an insert of two rows:

1 row into “EventQueueEntry”
1 row into “EventQueueEntryMetadata”

Note: Because it’s a multiple-row operation - the call will always be under the transaction.

Event Pull for processing

The queue pull will be triggered by:

When an event is being registered - the SMP API will notify Dispatcher by setting global events, forcing Dispatcher to wake up
When Dispatcher receives events from the queue, it will try to query the same queue once again when all events are dispatched in the current round.
When Dispatcher has “completed” some events for particular queues, these queues will be queried on this round.
When Dispatcher has a notice, that some of the worker threads are free.
When Dispatcher enters “idle” mode for the first time (no work after the last event was processed)

When Dispatcher performs a pull, it will:

Not update statistic table
Pull pending events from DB in “batch” mode, by 4x factor of queue processing thread number.
Mark pulled events as “ProcessingState = 1” until completion
Pull will be executed after processing pending event completion

Note: Because it’s a multiple-row operation - the call will always be under the transaction

Fixing stale queue entries

It is possible (but highly unlikely) that a service crash or code bugs will lead to the situation when events are marked as “processing”, while there is no activity in Dispatcher for them running.

To fix the stale queue entries, we have a timed action (adjustable, Core Setting: “EvtQueueFixupMinutes”), which will be performed to find out the queues without any processing activities in Dispatcher.

The default time span to perform an action is 10 minutes and can be triggered by these conditions:

When the Dispatcher starts
Any Dispatcher wake up – at the end, it will check for a fix-up timeout
Every “idle” cycle will check for a fix-up timeout

Note: Fix-up will only perform for a particular queue when the queue is not pending any events, no completes are queued and no workers are active.

Event Completion

The dispatcher will complete events in batch mode
Completion will be done in the same thread, as “Query Candidates”

How do we update statistics

Since Dispatcher’s logic also depends on the knowledge of what is really “pending”, it is still a good idea to update the statistics in the “EventQueue” table.

The recalculation of the queues will be done automatically by all parties – both registration and processing, but not in the same queries, as it was before (spRegister.. / spGetCandidates…).

There will be a few situations when recalculation will occur:

On a timely basis – every 5 minutes (adjustable, Core Setting: “EvtQueueReloadTimeout”)
In flood situations – when processing data size is over 250MB (adjustable, Core Setting: “EvtQueueReloadFlowMB”), either by registration logic or processing logic

Note: The main Core service (AeXSvc) will always have a reload timeout of 1 minute less, than Core Setting: this will eventually make it an “update master”.

Cross-process statistics update

There are several event sources:

Main service (AeXSvc – hosting Receiver and Dispatcher)
IIS (w3wp) – hosting web API for the agent to post events
Task Management Service (AtrsHost) – some tasks are providing events as a result
Any other custom code in any process, which will call SMP API to register events

All these processes can be a significant source of events, so there should be logic to minimize the pressure for the statistics recalculation.

It is accomplished by the “update master” approach:

When one of the sources detects either update timeout or flood, it will recalculate the statistics in the EventQueue table
The result of recalculation will be sent to all parties, which are interested in this information (automatically bound when any of them will register an event, for example)
When parties receive statistics, they will update their own local cache and reset the “update” & “flood” counters.

Effectively, there should be only one “update master” – AeXSvc, but in some cases, any of the parties can trigger the logic if it will become a flooding source.