Threshold monitoring settings for DA Rollups not working in CA Performance Management (CAPM)
search cancel

Threshold monitoring settings for DA Rollups not working in CA Performance Management (CAPM)

book

Article ID: 264086

calendar_today

Updated On:

Products

Network Observability CA Performance Management

Issue/Introduction

A Threshold profile is setup to check and alarm on the number of completed rollup processes:

Threshold profile setting [Event rules]

 Metric family:      Data Aggregator rollup calculation time
 Metric:             number of completed rollup processes
 Duration (seconds): 300 seconds
 Window (seconds):   300 seconds
 Importance:          Major

[Conditions for Violation]

 Event type:     Violation
 Metric:         number of completed rollup processes
 Threshold:      <= 0.0 "less than or equal to"
 Condition type: fixed value

[Conditions for clearing the violation]

  Event type:      clear
 Metric:          number of completed rollup processes
 Threshold:       > 0.0 "greater than"
 Condition type:  fixed value

However, it doesn't seem to work.

Environment

DX NetOps CAPM all currently supported releases

Cause

This is working as intended. 

Rollups are run hourly, so the data is hourly timestamps.  It's running between 21 and 40 rollups an hour.  All those values (between 21 and 40) are > 0.0, so no events are created. Which is why no alarms are generated.

Thresholding only works of the metric data collected.  Rollups self monitoring is metric data.  So when rollups are run, CAPM marks marks them in the DB to indicate they were run.  However, if it doesn't run rollups for a Metric Family (MF), it doesn't record self monitoring data for it.  Which is the case here.

CAPM doesn't cycle all metric families every rollup period to see if it has data to rollup.  It goes based off end of cycle messages in the AMQ Rollup_<MF> queues.  So at this time, there is no real way to determine if rollups are not running for a MF from a monitoring point of view.

To do this, you would probably need to run a script that does a vsql query on the DR (Vertica) every hour and queries various _ltd (hourly), or _eqd (daily) tables for each MF and if the latest timestamp is more than say 2-3 hrs old, send an email or something to notify administrators.

Resolution

You can view the Health Monitoring data as per:

TechDocs : DX NetOps CAPM 24.3 : View Health Monitoring Information

Under there, view the last 24 hours worth of data and check the rollups.

  1. Rollup self-monitoring data is labelled with 1 hr resolution, so it's only going to pass/fail on every hour when it checks. You could change the Duration/window to 3600 to encompass this.

  2. Check the last 8 or 24 hours for self-monitoring page with rollups in it and see if there is any data in it.

    If the rollup process has stopped working, then there will be no self-monitoring data to check and this threshold may not do anything due to lack of data.