NAS Auto-Operator set for 'overdue age' executed on an alarm that was closed before it reached that age
search cancel

NAS Auto-Operator set for 'overdue age' executed on an alarm that was closed before it reached that age

book

Article ID: 369222

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

A NAS Auto-Operator on the DX UIM primary hub is set to run a script on certain alarms after reaching a particular age using the "overdue age" functionality.

This script performs some activity and sends an email if the alarm has been open for over a certain number of seconds (e.g. 80 seconds).

 

We received an email from this script but upon investigation the alarm was already closed, and we found that the alarm must have been generated mistakenly, because the difference between the "open" and "closed" times of the alarm was less than the given "overdue age" value.  The alarm had closed after 59 seconds according to the timestamps on the alarm, but it still triggered this script for some reason.

 

 

Why would this occur?  Can it be prevented?

Shouldn't closing the alarm prevent it from becoming "overdue" ?

Environment

DX UIM Any version

Multiple NAS probes distributed to secondary hubs with replication enabled

Cause

NAS replication occurs in batches.  There is a small and unavoidable delay between the time an alarm arrives at a secondary NAS and the time it will be replicated to the primary hub NAS.

When an alarm arrives via replication, it has a timestamp on it that corresponds to the time it was opened - and when it is replicated to the primary hub, the "overdue age" starts counting from the original timestamp of the alarm.

In some rare cases, an alarm may close within the appropriate "overdue age" window, but the primary hub may not learn about the closure until the time has passed, and will have already triggered the corresponding Auto-Operator profile.

Then, when the alarm closure is replicated, the "closed" time is also replicated - so it reflects the time that the alarm was closed on the original/sending NAS.

If there is a delay in replicating the closure, so that the Auto-Operator has already fired, the alarm will be updated at closure with a timestamp that is now "in the past".

This means that the original/sending NAS saw the alarm as being alive for a shorter time than the receiving/replicated NAS.  So it is working as designed, in that the alarm has been "open" on the primary NAS for long enough to trigger the Auto-Operator because the primary hub just doesn't know about the closure yet.  The timestamps on the alarms represent the alarm's actual lifetime (at the original/sending NAS) but the delay can sometimes cause "overdue age" profiles to fire sooner than you would expect based on the recorded timestamps.

This can cause confusion when investigating after the fact, as there is no way to tell that the replication was delayed by looking at the timestamps.  The timestamps in this case are accurate to the lifecycle of the actual event - the AO fires because the primary NAS has not yet "seen" the closure and therefore treats the alarm as overdue, but when it is eventually closed, the timestamp will reflect the "true" closure time of the alarm, not the time that the NAS finally got around to recognizing the closure.  

Resolution

In order to avoid this confusion, it is important to reduce delays between the time an alarm is generated and the time it is received by a NAS that runs scripts/profiles against that alarm.

When possible, time-sensitive scripts/AO profiles should be run "locally", that is on the NAS that receives the alarm before replication is performed. This will ensure that the "overdue age" AO profile will run accurately as the closure will be detected in time.

Additional Information

The NAS Best Practices Guide contains additional information on improving the speed and stability of alarm processing.