Why do some alarms not clear automatically and how can I clear them?

book

Article ID: 35332

calendar_today

Updated On:

Products

DX Infrastructure Management NIMSOFT PROBES

Issue/Introduction

Background/Issue

Some probes generate alarms that are not automatically cleared out of the box. Here is a list below. Note that this is not a complete list but it covers some of the most common unacknowledged alarms.

- e2e_appmon/e2e_appmon_dev (default messages, e.g., did not complete on time)

- interface_traffic (SNMP connection alarms, max interface speed)

- cdm (boot alarms)

- interface_traffic (SNMP connection alarms)

- data_engine (ADO layer alarms)

- hub (corrupted queues)

- vmware (guest OS shut down at least at the time of this update - you would have to check this agst the current version)

- ntevl and logmon: the profiles are set to alarm on an event - there is no 'clear' condition.

- snmptd (some traps, e.g., HP and Dell)

- cisco_monitor (SNMP agent is not responding)

- All probes 'max restarts'

- sla_engine compliance reporting alarms

 

Solution

How to clear some alarms that are not cleared automatically:

interface_traffic (max interface speed alarms)

...The max interface speed (ifSpeed) of 'GigabitEthernetx/x/x' on 'xx.xx.xxx.xxx' could not be determined! Traffic alarms in percent of max speed cannot be issued. Please override interface speed or set alarm options on actual values.

This depends on exactly how you wish to process the alarm. To completely ignore this alarm, setup an 'exclude' AutoOperator pre-processing rule with the necessary filters to catch these types of alarms (probe, message using a regex, e.g., /.*<substring.*/, subsystem, etc.). If you want to acknowledge the alarm and still keep record of this alarm in your history, setup an Auto Operator profile with the acknowledgement (clear) action.

 

logmon

The basic process is that you create 2 watchers:

Watcher #1:

- watches for the Y

- sends the appropriate alarm

- has a suppression key of A

 

Watcher #2:

- Watches for N

- Sends the clear alarm

- has the same suppression key A

 This way the nas puts the 2 alarms together.

 

Controller (Robot inactive)

The 'robot xxx is inactive' is a hub-generated alarm. Each robot's controller has a 'hub update interval' which is set to 15 minutes by default. Once this period has been exceeded by ~1.5x the set interval a 'robot xxx is inactive' alert is generated. However, it's the hub, not the robot itself that generates this alert. Once the controller is back up and re-registers with the hub, this alarm should clear (usually in only a couple of seconds).

 

cdm

The BootAlarm is generated by the cdm probe, when the system is rebooted - but this is not related to the hub alarm when a Robot/controller does not send its "alive" messages. 'BootAlarm' alerts are system-level alarms when reboots take place, while the 'robot inactive' alarms occur when a controller hasn't checked in with the hub in ~ 1.5x it's configured update time. It's not uncommon to get a boot alarm but if the robot comes back up and checks in before the update time has elapsed => no 'robot inactive' alarm.

 

BootAlarm: Computer has been rebooted at <unable to determine>

…being returned for the uptime is either due to a) an unsupported OS being monitored, b) or corrupt perfmon counters, e.g., System Up Time, or c) some sort of timing issue where the uptime could not be determined for some reason, e.g., due to connectivity issue or latency etc.

 

InternalAlarm: Unable to get CPU data

This usually indicates that the perfmon counters for a Windows machine need to be rebuilt and it will not clear.

 

e2e_appmon_dev

------------------------

nimQoSStop() 'Stops the QoS timer.

resultTime = nimQoSGetTimer() 'Get Value

nimQoSSendTimer(target$) 'Send the response time measurement

if resultTime > criticalThreshold then

nimAlarmSimple(Major,target$ + " exceeded 10 seconds")

else

if resultTime > warningThreshold then

nimAlarmSimple(Minor,target$ + " exceeded 5 seconds")

endif

endif

 

if resulttime < warningThreshold then

nimAlarmSimple(Clear,target$ + " less than 5 seconds")

endif

---------------------------

The following line is expected to clear the past failed alert upon subsequent sucess.

nimAlarmSimple(0,script$ + " has completed successfully")

 

ntevl

In the 'alarm/post' tab of the profile tab of ntevl, you can set a suppression key.

The purpose of this key is to match up separate profiles. So, you would have one profile that generates a critical severity alarm and another profile that generates a clear alarm. What would tie them together is a common suppression key. The key can be any arbitrary string (as long as both profiles have the same string). The NAS matches the two alarms up and causes the clearing alarm to clear the previous alarm.

From the ntevl help doc:

Activate messages suppression features, to avoid multiple instances of the same alarm-event (variables may be used)

 

Custom probes

Alarms generated by custom developed probes require special consideration in the probe itself or via some other means to close the alarms.

 

Additional Information

The clear message in the probes’ default Message Pool have been altered. If someone has changed it, the associated alarm may not clear. The default clear message should be restored.

 

Environment

Release: CNMSPP99000-8.31-Unified Infrastructure Mgmt-Server Pack-- On Prem
Component: