alarm messages are being processed out of order
search cancel

alarm messages are being processed out of order

book

Article ID: 262494

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

We have some pre-processor/auto-operator rules in NAS which operate on alarms as they come in, and we have noticed that in some cases, alarms are not being processed in the same order that they were sent.

This causes problems for us because some of the scripts which run against these alarms depend on information which is obtained by running scripts against alarms which came just before them.

If alarms come out of order, the scripts will fail to populate the appropriate data.

Additionally, we are concerned that during an outage, a large flood of alarms will come in at once, and in some cases a "clear" alarm will be received before an "open", resulting in a situation where an alarm that should have cleared is left open erroneously.

Environment

Release : 20.4

Cause

This is related to the way alarm_enrichment handles messages by utilizing a thread pool; the threads read messages off the bus and put them back on simultaneously which can cause the order to become jumbled.

In the majority of circumstances a slight variation in the order of processing alarms will not be a major issue; alarms come in at intervals and generally two "related" alarms should not come in simultaneously - by the time a "clear" arrives, the alarm it is clearing has usually been present for at least 1 minute.

However, in the case of an outage or queue backup, where a number of alarms are waiting in the hub queues and then released suddenly, it is possible that related alarms may be consumed in the same cycle and be consumed out-of-order.

Additionally, the UIM message bus queues are expected in most cases to be FIFO (First In, First Out) queues and under heavy load, alarm_enrichment can violate this principle with the described behavior.

Resolution

Broadcom Support/Engineering are investigating the implications of this and the possibility of resolving it long-term but in the meantime a workaround is available.

At this time, it is unknown whether the workaround will impact performance, but in our testing internally it does not appear to have a significant impact on alarm throughput.

The workaround is to add the following key/value combination to nas.cfg in the <setup> section  (alarm_enrichment is configured via the nas probe config file).

threadpool_size = 1

 

This will make the alarm routing process single-threaded and messages will be processed in the order received.

After making this change, deactivate alarm_enrichment and NAS, then activate them again to apply the setting.