DX UIM nas storm protection

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM)

Issue/Introduction

How does NAS storm protection work?
How can I prevent alarm storms from affecting my DX UIM NAS and monitoring environment?
After a message flood affected the alarms in UIM, how can I prevent probes from sending a massive number of alarms to the NAS?

Environment

DX UIM 20.4.* / 23.4.* or higher

Cause

Guidance

Resolution

How does NAS Storm protection work:

The NAS probe supports a built-in storm protection feature that can prevent large continuous event storms from a robot or probe from causing problems for the NAS.
- The algorithm is constructed in a way that the NAS maintains a “quarantine list” for possible offenders.
The size of this list is configurable (storm_capacity) and elements will be added or removed or moved to the top depending on the message frequency.
- The event “signature” is constructed by source, domain, robot [,probe-id [,supp_key] -> elements of the inbound alarm message.
If the number of alarms matching the “signature” exceeds a threshold (storm_threshold) within a specified time-window (storm_timewindow) then succeeding alarms will be quarantined by re-publishing the message to the configured message Subject (storm_subject).
- The default message subject is NAS_QUARANTINE.
- The quarantined alarm will not be registered with the nas and a log entry is generated when the first set of messages is placed in quarantine.
- The alarm message text and severity level can be overridden via raw configure edit of the nas probe:

setup > storm_message
setup > storm_severity_level

storm_message supports variable expansion from the message header, e.g.,

Placing alarm(s) from $domain:$origin:$robot:$prid:suppkey=$supp_key, total:%d

storm_severity_levelwould be represented as:

storm_severity_level = 5

This would represent changing the alarm severity to Critical.

The storm_protection value causes the key “signature” elements to be:

0. disabled

1. source, domain, robot, probe-id and supp_key

2. source, domain, robot, probe-id

3. source, domain, robot

How to enable NAS Storm Protection:

Open the NAS GUI, select the General tab
Pick a type of protection from the Storm protection dropdown menu.
Once enabled, you will be able to choose your own Storm Subject header which will modify the message header for messages exceeding the threshold.
Set the threshold by which NAS will consider an alarm storm within a set interval of time.

Note:

The Storm capacity determines on how many messages are retained in the transaction log and how many will be discarded.

The NAS determines that the storm has died down based on the same logic, e.g., 1000 msg/5 min, and when this condition is not true anymore then it will return to a normal state. Keep in mind these times are asymmetric. If you had a storm of 2990 alarms in the first 10 seconds then 10 more alarms occur at 4:50 seconds, the storm will be over 10 seconds after it started. This is because the arrival time of the first batch was heavily biased on the start of the storm.

That is the duration for quarantined messages to be published back to the message BUS (NimBUS). It is a sliding window.

Additional Information

Infrastructure Manager (IM) and nas is very slow to respond