logmon alarms are missing - suppressed/combined with other alarms

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Symptoms include:

some alarms generated by the logmon probe appear to be missing as if the probe is not even sending them
other alarms from the same instance of logmon are present during the same timeframe
the alarms were successfully generated earlier but now we do not see them even though the matching messages appear in the log which is monitored
the logmon probe log shows the alarms being generated but they are missing in NAS/alarm console
queues are working fine and the messages can be seen using DrNimbus but seem to be missing from the alarm console
when looking at other logmon alarms around the same time, and viewing the transaction history, the alarms are found incorrectly mapped as part of the history of a previous alarm

Environment

DX UIM 23.4
logmon probe - any version

Cause

Each alarm in UIM has a suppression key defined which is used by NAS to differentiate between different "types" of alarms. Generally speaking, all monitoring probes send a suppression key with each alarm.

The NAS probe looks at each alarm and determines whether the source IP, source robot, source probe, and suppression key match a previous alarm, and if so, the alarm is suppressed, which is to say it is considered another "transaction" of the existing alarm and included as part of that alarm's history instead of being considered a new alarm.

The logmon probe, unlike other probes, does not include default suppression keys, they are instead defined by the user at configuration time when setting up the profiles in the probe. This way, the user has more granular control over which alarms "belong together' depending on the suppression keys defined in the profiles.

Example:

logmon is configured to alert on the keyword "Critical" in the log file and send a Critical alert containing the matched line.
Suppression key is defined as ${source}${profilename}
An alert is sent matching a line like "Critical alert received - server is running slow" with the given suppression key
Later, an alert is sent matching a line like "Critical alert received - server is low on disk space" with the same suppression key from the same source/robot
The second alert will be counted as an instance of the first alert and the alarm message on the first alert will be updated to the "low on disk space" message
The "server is running slow" alert will then seem to be "missing" because the message has been overwritten

You can confirm this by looking at the Transaction History for the other alarms which were received from the same logmon probe around the same time, and verifying that you see the expected alarm messages as part of another alarm's transactions.

Resolution

To resolve this issue, each "type" of alarm needs a unique suppression key so that it will only be combined with alarms of the same type. There are several ways to accomplish this.

each type of alarm can have its own watcher, with a unique regex to differentiate "types" of alarms, and then the suppression key can incorporate the ${WATCHER} variable or a combination of variables like ${PROFILE}${WATCHER}
a customer variable can be captured from the log message using regex capturing groups and that variable can be used in the suppression key as ${VAR} (or a user-defined variable name)
you can use ${WATCHERMATCHEDLINE} in the suppression key so that each matched line will be a unique instance of an alarm (note that this means even alarms of the same "type" will be unique if the matched line in the log differs each time - for example if the lines are timestamped - so essentially this option disables suppression unless the log lines are identical in every way)
A simpler approach would be to have a single watcher for every specific type of alarm, and hardcode part of the suppression key. For example, if the watcher is for a "Server is Down" message, try a suppression key like "server_down" - then any alarm from the same source, robot, and logmon probe instance with the same "server_down" suppression key will be considered as an instance of the same alarm.

There is no true "best practice" for this, as it would depend entirely on the user's preferences for how to monitor the alarms, how to differentiate different "types" of alarms, and what strategy would best meet the needs of the organization/administrators.

The goal in the end is for alarms of the same "type" to always have the same suppression key so they can be combined together appropriately, and for alarms of different types to have some way for the suppression key to indicate what type.

Additional Information

For more information on suppression keys see the logmon IM Configuration page.

See also Use and configure variables in logmon and the "Variable Expansion in Alarms" section of logmon Advanced IM Configuration

Further discussion about suppression keys specifically related to "Clear" alarms is available here.