search cancel

NSE Registration and Processing in SMP 8.x

book

Article ID: 150463

calendar_today

Updated On:

Products

IT Management Suite

Issue/Introduction

Problem statement

On heavily populated systems there was always a bottleneck between NSE registration and processing logic, related to the common statistic table [EventQueue], which is populated and modified by both sides.

Environment

ITMS 8.0, 8.1, 8.5, 8.6

Resolution

Solution

In release ITMS 8.0 we removed the absolute (real-time) synchronization for the [EventQueue] table for both sides – instead, it will be updated on a timely basis as well as for some corner cases by a specific call to “update query”.

Pros

  • There will be no SQL locks between the registration/processing logic sides
  • Transactions will be shorter and faster
  • Fewer consistency problems
  • Statistics table content will not be real-time

Cons

  • Statistics table content will not be real-time

     

EventQueueEntry – processing state column

“ProcessingState” of the event entry in this table will only have two values:

  • 0 – pending
  • 1 - processing

Event Registration without metadata

It will be a simple insert of a single row into the “EventQueueEntry” table.

Note: This call will be executed without transaction, but with deadlock retries.

Event Registration with metadata

It will be an insert of two rows:

  • 1 row into “EventQueueEntry”
  • 1 row into “EventQueueEntryMetadata”

Note: Because it’s a multiple row operation - the call will always be under the transaction.

Event Pull for processing

The queue pull will be triggered by:

  1. When an event is being registered - the SMP API will notify Dispatcher by setting global events, forcing Dispatcher to wake up
  2. When Dispatcher receives events from the queue, it will try to query the same queue once again when all events are dispatched in the current round.
  3. When Dispatcher has “completed” some events for particular queues, these queues will be queried on this round.
  4. When Dispatcher has a notice, that some of the worker threads are free.
  5. When Dispatcher enters “idle” mode for the first time (no work after the last event was processed)

When Dispatcher performs a pull, it will:

  • Not update statistic table
  • Pull pending events from DB in “batch” mode, by 4x factor of queue processing thread number.
  • Mark pulled events as “ProcessingState = 1” until completion
  • Pull will be executed after processing pending event completion

Note: Because it’s a multiple row operation - the call will always be under the transaction

Fixing stale queue entries

It is possible (but highly unlikely) that a service crash or code bugs will lead to the situation when events are marked as “processing”, while there is no activity in Dispatcher for them running.

To fix the stale queue entries, we have a timed action (adjustable, Core Setting: “EvtQueueFixupMinutes”), which will be performed to find out the queues without any processing activities in Dispatcher.

The default time span to perform an action is 10 minutes and can be triggered by these conditions:

  1. When the Dispatcher starts
  2. Any Dispatcher wake up  – at the end it will check for a fix-up timeout
  3. Every “idle” cycle will check for a fix-up timeout

Note: Fix-up will only perform for a particular queue when the queue is not pending any events, no completes are queued and no workers are active.

Event Completion

  • The dispatcher will complete events in batch mode
  • Completion will be done in the same thread, as “Query Candidates”

How we update statistics

Since Dispatcher’s logic also depends on the knowledge of what is really “pending”, it is still a good idea to update the statistics in the “EventQueue” table.

The recalculation of the queues will be done automatically by all parties – both registration and processing, but not in the same queries, as it was before (spRegister.. / spGetCandidates…).

There will be a few situations when recalculation will occur:

  • On timely basis – each 5 minutes (adjustable, Core Setting: “EvtQueueReloadTimeout”)
  • In flood situations – when processing data size is over 250MB (adjustable, Core Setting: “EvtQueueReloadFlowMB”), either by registration logic or processing logic

Note: The main Core service (AeXSvc) will always have a reload timeout by 1 minute less, than Core Setting: this will eventually make it an “update master”.

Cross process statistics update

There are several event sources:

  • Main service (AeXSvc – hosting Receiver  and Dispatcher)
  • IIS (w3wp) – hosting web API for the agent to post events
  • Task Management service (AtrsHost) – some tasks are providing events as a result
  • Any other custom code in any process, which will call SMP API to register events

All these processes can be a significant source of events, so there should be logic to minimize the pressure for the statistics recalculation.

It is accomplished by the “update master” approach:

  • When one of the sources detects either update timeout or flood, it will recalculate the statistics in the EventQueue table
  • The result of recalculation will be sent to all parties, which are interested in this information (automatically bound when any of them will register an event, for example)
  • When parties receive statistics, they will update their own local cache and reset “update” & “flood” counters.

Effectively, there should be only one “update master” – AeXSvc, but in some cases any of the parties can trigger the logic if it will become a flooding source.