Alarm Policy alarms keep getting cleared although the alarm condition persists

book

Article ID: 193799

calendar_today

Updated On:

Products

NIMSOFT PROBES DX Infrastructure Management

Issue/Introduction

Hi, 

in our production environment, we are noticing a strange behavior. 
We are monitoring numerous services on remote devices using the RSP probe.
Now, on some of these devices, these services are permanently down (at the moment). RSP is configured to raise an alarm if a service is down, and the alarms are being raised (good!). 


The problem is, that these alarms get cleared always after about 30-35 minutes automatically by the robot (see how the robot icon is already green?). 
If we wait another 5 minutes, a new alarm will have been raised for this issue, and it will be cleared automatically again after roughly 30 minutes by a clear alarm. (We are NOT clearing this alarm, check the attached spreadsheet for proof of the clear alarm issued by the robot).

Attached is an Excel spreadsheet with the NAS_TRANSACTION_SUMMARY and the NAS_TRANSACTION_LOG for just this service on this device. 

Please note: This problem is not limited to only this service, it appears for other services on the same device too. This problem is also not limited to this device, we observe it on other devices too. 

We are running: 
robot version 9.20HF13
rsp version 5.35

Please advise on how to resolve this issue. 
This is a massive problem for us because each new alarm creates a new ticket in our Incident Management system, which means that we have a much higher number of tickets being created than necessary. 


 

Environment

Release : 9.2.0

Component : UIM - ALARM POLICY

Resolution

Analyzed the sample data attached in the case.
We have found that sometimes qos messages are received upto 12 minute interval instead of 5 minute (qos messages are delayed). and for remote probe like rsp, qos can be delayed because of network latency issues.
So ,Please increase the sliding windows from 20 minute to 30 minute, which can help to accommodate the delayed qos in sliding window range and it won't allow to clear the tot alarms.

Please use below configuration and let us know the feedback:
TOT window=15 min(no need to change)
Sliding windows=20 min(please change it to 30 min)
Qos interval= 5 min (no need to change)