Time Over Threshold and Baseline Concept

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Environment

Release : 8.51

Component : UNIFIED INFRASTRUCTURE MGMT

Resolution

Baseline_engine:

"A baseline expresses normalized QoS levels on an hour-of-day and day-of-the week basis. The baseline_engine probe follows QoS messages on the Nimsoft bus and samples this data up to 30 times during each one-hour interval. This sampling rate provides a statistically accurate baseline while minimizing system resource use.

At the top of each hour, a baseline data point is calculated for each QoS monitor and sent to the qos_processor probe, which, after processing, writes this data to the NIS database. This first baseline approximation for the hour interval is available after the hour has concluded, and is improved with succeeding baseline data points from corresponding intervals gathered over a four-week period.

In a single Hub configuration of NMS, simply deploying and activating the baseline_engine probe on the primary Hub will result in baselines being calculated--no other configuration is required."

TOT short explanation:

Here is a summary of the information provided by development; the short version of this could be interpreted as "Time TO Threshold requires 2 hours of baseline data; time OVER threshold should not require any baseline data."

Time To Threshold (TTT). Owned in the Analytics team, the TTT product functionality is implemented by the prediction_engine. The probe observes a QoS for two hours (in the beginning), storing the median value for each hour. With two or more hours of data, a linear regression is calculated of the recent data. If the QoS expresses a trend that will cross some numeric threshold within some configured time frame, an alarm is issued.

Time Over Threshold (TOT). TOT, owned in the Events team, is implemented by the nas (and alarm_enrichment?) probes. If the threshold limit exceeds a threshold for longer than the specified Time Over Threshold <TOT>, at any severity level within the specified sliding time window <TW>, then raise an alarm. Optionally, clear the alarm if it is below all of the thresholds after the specified delay <TC>.

Here is some additional detail on how TOT (Time Over Threshold) works:

The alarm_enrichment (AE) probe is where the actual time-over-threshold logic resides. When a user sets the TOT parameters, a callback is issued to alarm_enrichment with the TOT rule data. A TOT rule consists of the following items:

1. Key: The key of the rule, typically follows the met_id:et_id format.
2. Active: a boolean value that dictates whether the rule is active or not.
3. Time: The amount of time (in seconds) that the metric must be over threshold before an alarm is fired
4. Window: The window during which the time condition must be met to fire an alarm
5. AutoClear: 0 or 1 to indicate whether there is an auto-clear timer
6. ClearTime: The time for the auto-clear timer that will close the alarm if the conditions are no longer met.

These rules are stored in a rule_config.xml file in the Nimsoft/probes/service/nas/alarm_enrichment directory and can be seen there in the XML form or queried via the list_tot_rules callback. Single rules can also be queried by key using the get_tot_rule callback.

An alarm really isnt an alarm until it arrives at the NAS. Using AE, we are holding alarms back until they meet a specific time over threshold.

As far as the inner workings within AE, it receives alarms from PPM if they match the threshold. Each alarm represents a certain amount of time over threshold based on the polling frequency. If a particular metric has a polling frequency of 1 minute and a rule of 5 minutes over threshold in a 10 minute window, AE will keep track of alarms by looking back 10 minutes in its history to see how many alarms it received. If it received >= 5 alarms (e.g. 1 x 5), then it would forward the alarm to the NAS. If autoClear is enabled, a timer is set that expires after the clearTime and sends a close alarm to NAS. This timer is reset if the condition continues to be met with subsequent alarm