Alarms are delayed and alarm_enrichment queue is backed up with messages.

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

The alarm_enrichment queue is processing but not keeping up and getting more and more messages in the queue. Alarms showing up in the console are behind with new alarms showing up that were generated hours ago.

Environment

Release: UIM 20.x/23.4x or higher
Component: Alarm Enrichment

Cause

There are multiple factors that could be resulting in this queuing problem.

Check C:\Program Files (x86)\Nimsoft\probes\service\nas\alarm_enrichment\alarm_enrichment.log

Potentially there could be a problem alarm, an out of memory condition, or something else.

Samples:

Jul 13 15:35:55:299 [pool-1-thread-20, alarm_enrichment] InboundAlarmProcessor:Error: processing alarm: cc3... 
Jul 13 15:35:55:299 [pool-1-thread-20, alarm_enrichment] java.lang.OutOfMemoryError: GC overhead limit exceeded 

No element found for key level

Resolution

Error "No element found for key level"

In the case of a problem alarm, the Tools->DrNimBUS utility can be used to locate and delete the problem alarm message from the queue.

Use the DrNimbus tool to fetch unread messages from the alarm_enrichment queue. If the messages are missing any fields other than md5sum, or if they do not match the expected alarm message Subject, then those messages could be holding the queue up. To access the DrNimBUS tool, on the Primary hub select Start button->All Programs->Nimsoft Monitoring and then Tools->DrNimBUS.

Use this button to use the tool on the primary hub:

Choose a queue, and then fetch single messages using the + button.

Considerations:

1) It can help to set the bulk size on nas/ alarm_enrichment down to 20 or 10. If the probe still cannot load the first ten messages, then you know one or more of the first ten messages is causing an issue.

2) If the messages contain a subject of QOS_MESSAGE, then they are the incorrect subject and cannot be processed by alarm_enrichment. Fetch all QOS_MESSAGEs out until the subjects are

'alarm' or filter on 'alarm' as a Subject.

3) The DrNimbus tool permanently marks the messages as read. Once you see them in DrNimbus, they cannot be delivered to the probe.

4) Using Dr. Nimbus to read the alarm_enrichment queue may disconnect alarm_enrichment. In this case, the queue will show as yellow, and the probe will need to be restarted.

Memory issues

Memory allocation for alarm_enrichment can be modified via selecting the probe from the IM Console > right click Edit > Arguments

The default is:

-Xms64m -Xmx1024m -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar

Check system memory usage to ensure there is enough free RAM (4gb free after the increase), such as for Windows via Task Manager.

Provided there is enough free memory the probe allocation can be increased.

-Xms1024m -Xmx2048m -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar
or
-Xms2048m -Xmx4096m -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar

After making the change deactivate and then activate the probe.

alarm_enrichment queue backing up

The alarm_enrichment queue reads alarm messages from the bus based on the value of bulk_read_size as set in the nas.cfg:

This should be considered a maximum value, e.g. if you set it to 1200, the alarm_enrichment probe will request 1200 messages in each transmission. However, if fewer messages exist on the bus, it will not wait until there are 1200 messages pending - each transmission will read "up to" 1200 messages.

After consuming the alarm messages, alarm_enrichment then outputs alarm2 messages which are consumed by the NAS probe.

The key “bulk_read_size” only applies to alarm_enrichment's consumption of the alarm messages from the bus. The NAS itself will always read alarm2 messages from the message bus one at a time, as fast as it can - raising the bulk size for NAS would have no advantage as alarms are processed sequentially as quickly as they come in to the pre-processor so the NAS is hardcoded to always request a bulk of '1' when active.

The local ATTACH queue on the Primary hub nas will always display a Bulk Size of 1 when it is Activated:

If this queue is backing up, it may not necessarily be slowing down the throughput of alarms - check to see if the "sent" value is increasing as quickly as the "queued" value and that "queued" is trending down over time - if so, and if alarms are being received in a timely fashion then this may not necessarily represent a problem.

If the queue continues to increase over time, instead of trending downward, the focus should be on either reducing the inflow of alarms, or improving throughput of the NAS by reducing pre-processing, optimizing Auto-Operators/scripts, etc - it is not possible to increase the speed at which NAS reads from the bus, and even if it were, it would not improve the speed at which NAS processes alarms as they would still be queued internally waiting for the pre-processor.