logmon - shared folder/file monitoring on a cluster causes duplicate alarms

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

When the logmon probe monitors a log file on a shared cluster resource, and the logmon profile is managed by the cluster probe, a failover to another node causes duplicate alarms for entries already processed by the previously active node.

Environment

DX Unified Infrastructure Management (UIM) - any version
logmon (all versions), cluster (all versions)
Cluster environments (e.g., RedHat Pacemaker, Windows Cluster) using shared storage for monitored logs

Cause

The logmon probe is not fully cluster-aware. Each instance of the probe on separate nodes maintains its own local `logmon.dta` file to track the last read position. Because these files are not synchronized between nodes, a newly active instance may re-read log data from its last known position, leading to duplicate alarms.

Resolution

There are two workarounds which can be implemented. Each workaround has drawbacks which should be carefully considered to determine which one is appropriate.

Workaround 1: "End of file" configuration

Open the logmon configuration.
Set the profile Mode to updates.
Set initial file read position to "end of file".
Set resume file read position to "end of file".

This will cause each instance of the logmon probe to only read alarms that are written to the file after the probe starts up and reads the file; when the failover occurs, the newly active instance will record the new "end of file" position and start reading from that point forward, so it will not re-send the previous alarms.

Drawback: This may cause the probe to miss exactly one log entry if it is written during the brief window when the probe is starting up on the new node, before the probe has an opportunity to record the "end of file" position. If this is unacceptable, consider using Workaround 2.

Workaround 2: Alarm Suppression

When configuring the profile, you can set up certain options so that the alarms from the newly-active node will increase the count of the alarms that the other node already sent instead of generating new alarms.

Do not implement the "end of file" workaround, but instead, leave the defaults ("start of file" and "last read position".)
Standardize Source: Set/override the "Source" field in the logmon profile to the Cluster Name instead of the individual robot name, so that when it fails over to the other node, it will keep the same Source field for alarms.
Unique Suppression Key: Use a suppression key that includes the specific log line, such as ${PROFILE}${WATCHER}${WATCHERMATCHEDLINE}.

Result: The newly active node will still re-send all the alarms that were already sent by the first one; when the second node re-sends the alarms, the NAS will recognize the identical source and suppression key and increment the count on the existing open alarms.

Drawback: No alarms will be missed, but the Alarm counts will be inaccurate, as this will result in most alarms having a count of "2" even if they've only come in once; after each failover, all alarms with a count of "1" will be incremented to "2" when the failover occurs. (Further failovers will not increment the counts beyond 2 because the logmon probe on each instance will have recorded the new file position.)

Additional Information

An enhancement request has been filed for native cluster awareness. You can track and upvote the request here: Logmon profiles should be cluster-aware