cdm probe alarm severity is inconsistent across machines for the same alarm type
search cancel

cdm probe alarm severity is inconsistent across machines for the same alarm type

book

Article ID: 424686

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Symptoms:

  • multiple instances of the CDM probe have been configured with a superpackage to have identical settings for alarms
  • some instances are sending an alarm with major severity while others send the same alarm with critical severity
  • the thresholds are the same and the "high" and "low" alarms are configured identically but the message severity still differs

 

 

Environment

DX UIM - Any Version
CDM probe 5.41 or later

Cause

Over time and across different platforms, the severity of some messages has changed, and depending on which version you started with, and which versions have been applied over time, some inconsistencies may develop.

 

Resolution

In CDM, the alarm definitions essentially have two parts:

1.  The thresholds which are set
2. The messages and severities which are used

Examining the cdm.cfg and in particular the <messages> section should reveal the discrepancy.

For example, a filesystem may be set to alarm at 10% for the "high" threshold and 20% for the "low" threshold, and in the cdm.cfg when looking at that disk, it will appear as follows:

 

             <error>
               active = yes
               message = DiskError
               threshold = 10
            </error>
            <warning>
               active = yes
               message = DiskWarning
               threshold = 20
            </warning>

 

The "message" field defines which message is used for each threshold type, for example "message = DiskError"

Lower in the cdm.cfg file you will find the <messages> section which contains the DiskError message definition and this is where the severity is defined.

For example:

<DiskError>
      text = Average ($value_number samples) disk free on $drive is now $value$unit, which is <= error threshold ($value_limit$unit) out of total size $size_gb GB
      level = major
      token = disk_error
      i18n_token = as#system.cdm.avrg_drive_diskfree_below_err_threshold
      clear_text = Disk clear - $drive average $value$unit
   </DiskError>

you can see here it is defined as "level = major" which means that any alarm that references the "DiskError" message will be sent as a "major" alert.

If you wish to change the severity of the alarm, change the "level" value here, and then all alarms which use this message definition will have the severity changed.

So, for a critical alert, you would set it as follows:

<DiskError>
      text = Average ($value_number samples) disk free on $drive is now $value$unit, which is <= error threshold ($value_limit$unit) out of total size $size_gb GB
      level = critical
      token = disk_error
      i18n_token = as#system.cdm.avrg_drive_diskfree_below_err_threshold
      clear_text = Disk clear - $drive average $value$unit
   </DiskError>

 

Valid values are:

info, warning, minor, major, critical