alarm enrichment queue is increasing and queuing up creating a heavy backlog
search cancel

alarm enrichment queue is increasing and queuing up creating a heavy backlog

book

Article ID: 272987

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

alarm enrichment queue is increasing, So there is delay in alarms clearing in the UIM Console (alarm subconsole)

Environment

  • Release: 20.4

Resolution

  • nas preprocessing rules were in place for Hitachi alarms but there was no reason to have those rules any longer as the connections were no longer valid so they were failing and generating a lot of alarms and this overhead was unnecessary for the nas. Maybe due to IP address change or some other environmental factor so we deactivated those connections to stop the "Failed to" alarms and then safely disabled the nas preprocessing ->Hitachi alarm exclusion rules.

  • Yet, the main issue in this case is monitoring governance

    • cdm iostat alarms were being generated every 5 minutes for every instance of cdm, then cleared.

    • Disk alarms were being generated and cleared within a few minutes as well.

    • url_response probe profiles with alarm counts in the 'tens of thousands' were also generating alarms.

    • There are too many alarms with very high alarm counts and alarms are being generated very frequently, and many of them being cleared, then generated again (vicious cycle), as seen via DrNimBUS Sniffer.

  • Besides that, we increased the alarm_enrichment java min/max memory to 3g and 5g and cold started the nas and alarm_enrichment probes.

  • We had to empty the queue many times until we alleviated the alarms that were queued up and still backing up as per the hub queue.

    • All through these efforts, we saw that the alarm_enrichment kept processing messages and as we chipped away at some of the high alarm count messages, and then it was able to send an increasing number of messages.

  • Eventually, the alarm_enrichment queue fell to 0 repeatedly and/or remained at a low value so the problem was resolved.