Why do some UIM alarms not clear automatically and how can I clear them?

Products

DX Unified Infrastructure Management (Nimsoft / UIM) DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) Unified Infrastructure Management for Mainframe

Issue/Introduction

In one or more cases/scenarios alarm messages may not clear as expected, e.g., custom messages, invalid alarm messages, and too many other side cases to mention in a single KB Article.

An automatic clear message will be issued if alarm messages on breached thresholds have previously been issued and the newly-measured value is now within the threshold setting/range.

Alarm clearing across all probes is currently not 100% consistent. In general, an alarm gets cleared when the same alarm is received by the nas with a '0' severity (CLEAR), but different probes have different criteria for when they send a clear alarm. Some probes send a clear when a condition no longer exists, others do not.

For instance, if you define custom messages and the variables aren't expanding due to invalid format of the variable in the message text then it will not 'match' and technically you would be getting a 0 alarm for an alarm that does not previously exist and the alarm that does exist would remain - since the nas never got an exact matching 0-sev alarm.

Some probes generate alarms that are not automatically cleared out of the box. A list is provided below. Note that this is not a complete list but it covers some of the most common unacknowledged alarms.

- e2e_appmon/e2e_appmon_dev (default messages, e.g., did not complete on time)

- interface_traffic (SNMP connection alarms, max interface speed)

- cdm (boot alarms)

- interface_traffic (SNMP connection alarms)

- data_engine (ADO layer alarms)

examples:

Failed to insert QoS data into the database, check that the database is running.

[Microsoft SQL Server Native Client 11.0] TCP Provider: An existing connection was forcibly closed by the remote host.

- hub (corrupted queues)

- vmware (guest OS shut down at least at the time of this update - you would have to check this agst the current version)

- ntevl and logmon: the profiles are set to alarm on an event - there is no 'clear' condition.

- snmptd (some traps, e.g., HP and Dell)

- cisco_monitor (SNMP agent is not responding)

- All probes' 'Max. restarts' alarms

- sla_engine compliance reporting alarms

Environment

Release: UIM 8.31 or higher

Cause

- various reasons

Resolution

How to clear some alarms that are not cleared automatically:

interface_traffic (max interface speed alarms)

...The max interface speed (ifSpeed) of 'GigabitEthernetx/x/x' on 'xx.xx.xxx.xxx' could not be determined! Traffic alarms in percent of max speed cannot be issued. Please override interface speed or set alarm options on actual values.

This depends on exactly how you wish to process the alarm. To completely ignore this alarm, setup an 'exclude' AutoOperator pre-processing rule with the necessary filters to catch these types of alarms (probe, message using a regex, e.g., /.*<substring.*/, subsystem, etc.). If you want to acknowledge the alarm and still keep record of this alarm in your history, setup an Auto Operator profile with the acknowledgement (clear) action.

logmon

The basic process is that you create 2 watchers:

Watcher #1:

- watches for the Y

- sends the appropriate alarm

- has a suppression key of A

Watcher #2:

- Watches for N

- Sends the clear alarm

- has the same suppression key A

This way the nas puts the 2 alarms together.

controller (Robot inactive)

The 'robot xxx is inactive' is a hub-generated alarm. Each robot's controller has a 'hub update interval' which is set to 15 minutes by default. Once this period has been exceeded by ~1.5x the set interval a 'robot xxx is inactive' alert is generated. However, it's the hub, not the robot itself that generates this alert. Once the controller is back up and re-registers with the hub, this alarm should clear (usually in only a couple of seconds).

cdm

The BootAlarm is generated by the cdm probe, when the system is rebooted - but this is not related to the hub alarm when a Robot/controller does not send its "alive" messages. 'BootAlarm' alerts are system-level alarms when reboots take place, while the 'robot inactive' alarms occur when a controller hasn't checked in with the hub in ~ 1.5x it's configured update time. It's not uncommon to get a boot alarm but if the robot comes back up and checks in before the update time has elapsed => no 'robot inactive' alarm.

BootAlarm: Computer has been rebooted at <unable to determine>

…being returned for the uptime is either due to a) an unsupported OS being monitored, b) or corrupt perfmon counters, e.g., System Up Time, or c) some sort of timing issue where the uptime could not be determined for some reason, e.g., due to connectivity issue or latency etc.

InternalAlarm: Unable to get CPU data

This usually indicates that the perfmon counters for a Windows machine need to be rebuilt and it will not clear.

e2e_appmon_dev

------------------------

nimQoSStop() 'Stops the QoS timer.

resultTime = nimQoSGetTimer() 'Get Value

nimQoSSendTimer(target$) 'Send the response time measurement

if resultTime > criticalThreshold then

nimAlarmSimple(Major,target$ + " exceeded 10 seconds")

else

if resultTime > warningThreshold then

nimAlarmSimple(Minor,target$ + " exceeded 5 seconds")

endif

if resulttime < warningThreshold then

nimAlarmSimple(Clear,target$ + " less than 5 seconds")

endif

---------------------------

The following line is expected to clear the past failed alert upon subsequent success.

nimAlarmSimple(0,script$ + " has completed successfully")

ntevl

In the 'alarm/post' tab of the profile tab of ntevl, you can set a suppression key.

The purpose of this key is to match up separate profiles. So, you would have one profile that generates a critical severity alarm and another profile that generates a clear alarm. What would tie them together is a common suppression key. The key can be any arbitrary string (as long as both profiles have the same string). The NAS matches the two alarms up and causes the clearing alarm to clear the previous alarm.

From the ntevl help doc:

Activate messages suppression features, to avoid multiple instances of the same alarm-event (variables may be used)

Custom probes

Alarms generated by custom developed probes require special consideration in the probe itself or via some other means to close the alarms.

Additional Information

The clear message in the probes’ default Message Pool have been altered. If someone has changed it, the associated alarm may not clear. The default clear message should be restored.