alarm_enrichment queue getting stuck and queuing alarms
search cancel

alarm_enrichment queue getting stuck and queuing alarms

book

Article ID: 250600

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

The alarm_enrichment queue spikes on the same interval as the prepopulation query interval ('enrichment_cache_prepopulation_interval_in_seconds')
*Alarm messages are not being sent to the ticketing system, so customer is unaware of potential problems.
*Alarm_enrichment queue graph created in Grafana shows a consistent high peak for alarm_enrichment and clears up on its own.

Environment

  • Release : 20.4
  • Component : UIM NAS
  • Environment:
  • UIM 20.4 on Windows with Hub/Robot 9.35
  • nas/ae 9.34Hf1
  • ems 10.29
  • discovery_server 20.41
  • data_engine with Partitioning Enabled
  • UIM Database - Microsoft SQL Server 2017 (RTM-CU26) (KB5005226)
  • Approx 6500 alarms updates generated every hour
  • Alarm Enrichment done on most if not all incoming alarms
  • NAS local .db files: database.db is 41MB.
  • transactionlog.db is 1.1GB

Cause

- Grafana graph interval setting

- Need for monitoring governance

Resolution

nas and alarm_enrichment are working as designed and queue backups are being cleared efficiently.

Monitoring governance should be taken into consideration.

- Implement means/methods of reducing unnecessary or 'high frequency' alarm noise.

- Improve the decommissioning process for robots.

- Reduce monitoring interval frequency when and where appropriate.

- Review and follow this KB Article listed below to help eliminate/reduce robot inactive alarms:

What is the time interval or delay for a Robot inactive alert?
https://knowledge.broadcom.com/external/article/139793

- Grafana graphs: upon investigation and discussion it was revealed that the shared graphs of alarm queue backups did not accurately reflect the level of detail required to conclude there was a problem in processing the queue. The graph data was aggregated and somewhat misleading. This was proven by reproducing the queue alarm flurry and watching the queue being processed. It was very fast and finished over 20k alarms in under 4 minutes.
 
Monitoring Governance
As we have seen in other UIM environments, ultra-high alarm counts (200-500k) for frequently-issued alarms can also be due to gaps in the change control or decommissioning process and/or lack of interdepartmental communication. When more and more VMs/robots are decommissioned over time without being removed from the UIM domain, this can continue to happen and the number/frequency and total count of those alarms will continue to grow.

Additional Information

How to use a nas preprocessing rule to prevent updates to the transactionlog.db and nas_transaction_log
https://knowledge.broadcom.com/external/article?articleId=250612