How does nas Alarm Protection work ?
The nas probe (3.60 or higher) supports a built in storm protection feature that will prevent large continuous event storms from a robot or probe to cause problems for the nas. The algorithm is constructed in a way that the nas maintains a “quarantine list” for possible offenders. The size of this list is configurable (storm_capacity) and elements will be added or removed or moved to the top depending on the message frequency.
The event “signature” is constructed by source, domain, robot [,probe-id [,supp_key] elements of the inbound alarm message. If the number of alarms matching the “signature” exceeds a threshold (storm_threshold) within a specified time-window (storm_timewindow) then succeeding alarms will be quarantined by re-publishing the message to configured subject (storm_subject). The default subject is NAS_QUARANTINE.
The quarantined alarm will not be registered with the nas and a log-entry is generated when the first messages is placed in quarantine. The alarm message text and severity level can be overridden (storm_message, storm_severity_level)
The storm_protection value causes the key “signature” elements to be:
1. source, domain, robot, probe-id and supp_key
2. source, domain, robot, probe-id
3. source, domain, robot
The storm_message string supports variable expansion from the message header, e.g.
Placing alarm(s) from $domain:$origin:$robot:$prid:suppkey=$supp_key, total:%d
You enable nas Storm protection by opening the nas GUI, selecting the General tab and picking a type of protection from the Storm protection dropdown menu. This will allow you to choose between Suppression-ID, Robot, or Probe as the source of your message filter.
Once enabled, you will be able to choose your own Storm Subject header which will modify the message header for messages exceeding the threshold.You can then set the threshold by which nas will consider an alarm storm and within a set interval of time. The Storm capacity determines how many messages are retained in the transaction log and how many will be discarded.
The nas determines that the storm has died down based on same logic i-e 3000 msg/5 min and when this condition is not true anymore then it will return to normal state. But, keep in mind these times are asymmetric. If you had a storm of 2990 alarms in the first 10 seconds then 10 more alarms occur at 4:50 seconds… the storm will be over 10 seconds after it started.This is because the arrival time of the first batch was heavily biased to the start of the storm
That is the duration for quarantined messages to be published back to the nimsoft bus. It is like samples value in cdm probe - when the storm dies down. It is a sliding window.