Deactivate DX UIM probes self-monitoring alarms

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

A Self-Monitoring alarm will trigger when an Automonitor generation failed, Static monitor failed or Data collection failed.

Example of an alarm in the vmware probe:

Self-Monitoring Failures for 'ESX01:VM.GuestMemoryUsage': Data Collection (2 of 10 failed). See vmware.log for more details

Example of an alarm in the HP_3Par probe:

Failed to update/fix details for static monitor 'HO-3PAR2.Storage Total Capacity'

Environment

DX UIM 20.4.* / 23.4.*
self-monitor enabled probes (any version)
- websphere_mg
- XenServer
- OpenShift
- vmware
- HP_3Par
- hitachi
- sap_basis
- vnxe_monitor

Cause

Guidance

Resolution

This alarm feature is available in some monitoring probes.

The purpose of the alarm:

This is an indication of data collection in your monitor (metric).
How to disable the alarm feature:

Open your probe in Raw Configure, and add the below key under <setup> section.

enable_self_monitoring_alarm = false
How to change the severity of the alarm:

Open your probe in Raw Configure, and add the below key under <setup> section.

self_monitoring_alarm_severity = <Desired number>

(5-Critical, 4-Major, 3-Minor, 2-Warning, 1-Informational). Default is 4
How to change the alarm to be generated per failed metric than failed metric 'type'

Open your probe in Raw Configure, and add the below key under <setup> section.

enable_self_monitoring_alarm_aggregation = false

By default, the probe aggregates self-monitoring alarms based on monitor type.

For example, if “GuestMemoryUsage" metric data collections failed for 2 VMs, it will aggregate them and only one alarm will be generated.

Self-Monitoring Failures for 'ESX01:VM.GuestMemoryUsage': Data Collection (2 of 10 failed). See vmware.log for more details

This aggregated alarm will indicate how many failed (e.g. - 2 out of 10 failed). With that change, the probe will generate an alarm for each incident, like below.

Failed to collect data for monitor 'VM01.GuestMemoryUsage'. Updated value will not be available. Failed to collect data for monitor 'VM02.GuestMemoryUsage'. Updated value will not be available.
How to not send the same alarm when the data failure continues to happen

Open your probe in Raw Configure, and add the below key under <setup> section.

enable_self_monitoring_alarm_same_error_suppression = true

By default the probe will resend the same failing self-monitoring alarm each probe collection cycle with the same suppression key.
With that change, the alarm will only be sent when it occurred under the first occurrence, if the number of errors changed or if the probe is restarted.
Finally this error can be caused by outdated static alarm definitions

Additional Information

Note 1: These Self-Monitoring Alarm Failures are aggregated for an element.metric type per resource. Individual failure details or related exceptions should proceed this log entry.
Note 2: 'Monitor Correlation' failures occur when a monitor does not find it's specific element in the inventory, or no metric value is available for the element.
With static monitors and changing inventory, these are sometimes expected and may be transitory.
Note 3: The failure count of 'Data Collection' failures often correlate with Monitor Correlation failures.
When there are only 'Data Collection' failures, or when they exceed 'Monitor Correlation' failures, that usually indicates a problem in collecting that metric value.
Some metric values are only available with additional system administration.
Some metric values are only available for specific element types. For instance one type of storage might have a metric, while another does not.
Generally it is desirable to understand 'Data Collection' failures for desired metrics, and sometimes the probe needs to be tuned for them