MCS Alarm Policies Not Alarming

book

Article ID: 205952

calendar_today

Updated On:

Products

CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) NIMSOFT PROBES DX Infrastructure Management

Issue/Introduction

Able to define an alarm policy but it's not alarming on devices that have metrics that violate the threshold values. Also, I can enable/disable the policy but delete is not working.

Have mon_config_service & mon_config_service_recon 20.31hf1 & OC 20.32 installed. (OC 20.3.2 patch fix installed)

 

Cause

1. Threshold/baseline settings

> The screenshots show the alarm threshold for Swap Memory Usage (Megabytes), and the threshold settings: warning, static, > 1 MB, immediate, not time over threshold. The metric value for server1 (in the screenshot) from yesterday was 3MB and is now 7 MB, still no alarm.

 

2. Plugin metrics Go to UIM/plugin/plugin_metric  (will need see the threshold number)

[[email protected] ~]# cat /opt/nimsoft/plugins/plugin_metric/plugin_metric.cfg   
 
<spooler-metrics>                                                              
   <url_response>
      <2688>
         <QOS_URL_DOWNLOAD_TIME>
            subsys =
            publish_qos = true
            qos_target = ~.*
            qos_source = ~.*
            qos_label = Download time : Milliseconds
            alarm = false
            publish_baseline = false
            qos_name = QOS_URL_DOWNLOAD_TIME
         </QOS_URL_DOWNLOAD_TIME>
         <QOS_URL_LASTBYTE_TIME>
            alarm = false
            qos_source = ~.*
            publish_baseline = false
            qos_target = ~.*
            subsys =
            qos_name = QOS_URL_LASTBYTE_TIME
            publish_qos = true
            qos_label = Time to last byte : Milliseconds
         </QOS_URL_LASTBYTE_TIME>
         <QOS_URL_FIRSTBYTE_TIME>
            alarm = false
            publish_baseline = false
            subsys =
            qos_source = ~.*
            publish_qos = true
            qos_name = QOS_URL_FIRSTBYTE_TIME
            qos_label = Time to first byte : Milliseconds
            qos_target = ~.*
         </QOS_URL_FIRSTBYTE_TIME>
         <QOS_URL_DNSRESOLVE_TIME>
            subsys =
            qos_target = ~.*
            publish_qos = true
            alarm = false
            qos_name = QOS_URL_DNSRESOLVE_TIME
            qos_source = ~.*
            qos_label = DNS resolution time : Milliseconds
            publish_baseline = false
         </QOS_URL_DNSRESOLVE_TIME>
         <QOS_URL_STRINGFOUND>
            qos_label = Sub-string found : State
            qos_target = ~.*
            qos_name = QOS_URL_STRINGFOUND
            alarm = false
            publish_baseline = false
            publish_qos = true
            subsys =
            qos_source = ~.*
         </QOS_URL_STRINGFOUND>
         <QOS_URL_BYTES_SEC>
            alarm = false
            qos_label = Fetch time bytes per second : Bytes/second
            qos_name = QOS_URL_BYTES_SEC
            publish_baseline = false
            subsys =
            qos_target = ~.*
            qos_source = ~.*
            publish_qos = true
         </QOS_URL_BYTES_SEC>
         <QOS_URL_TCPCONNECT_TIME>
            qos_name = QOS_URL_TCPCONNECT_TIME
            alarm = false
            qos_target = ~.*
            qos_label = TCP connect time : Milliseconds
            publish_qos = true
            publish_baseline = false
            qos_source = ~.*
            subsys =
         </QOS_URL_TCPCONNECT_TIME>
         <QOS_URL_REDIRECT_TIME>
            qos_source = ~.*
            subsys =
            qos_name = QOS_URL_REDIRECT_TIME
            publish_baseline = false
            qos_target = ~.*
            publish_qos = true
            alarm = false
            qos_label = Redirect Time : Milliseconds
         </QOS_URL_REDIRECT_TIME>
         <QOS_URL_BYTES>
            alarm = false
            qos_label = Fetch size in  bytes : Bytes
            qos_name = QOS_URL_BYTES
            qos_source = ~.*
            publish_qos = true
            subsys =
            publish_baseline = false
            qos_target = ~.*
         </QOS_URL_BYTES>
         <QOS_URL_RESPONSE>
            alarm = false
            qos_name = QOS_URL_RESPONSE
            publish_baseline = false
            subsys =
            qos_source = ~.*
            qos_target = ~.*
            publish_qos = true
            qos_label = Response time : Milliseconds
         </QOS_URL_RESPONSE>
      </2688>
   </url_response>
   <policy_27>
      <url_response>
         <metric_85>
            alarm = true
            qos_target = ~.*
            qos_name = QOS_URL_DNSRESOLVE_TIME
            policy_id = 27
            qos_source = ~.*
            metric_type_id = 2.2.2.2:4
            metric_precedence = 100
            <alarms>
               <threshold_90>
                  custom_message = ${metric_name} on ${component_name} for ${device_name} is at ${metric_value} ${metric_unit}.
                  ttt = false
                  tot = false
                  custom_clear_message = ${metric_name} on ${component_name} for ${device_name} is OK.
                  threshold = 20.0
                  severity = 1
                  operator = GE
                  thresh_type = static
               </threshold_90>
            </alarms>
         </metric_85>
      </url_response>
   </policy_27>
</spooler-metrics>
<setup>
   send_baselines = 1
</setup>
<default_alarms>
   <static_1>
      thresh_type = static
      name = static_1
      tot = 0
      ttt = 0
      clear = 0
      group = standard_1
      <alarm_fields>
         subject = alarm
         udata.token = as#standard.alarm.format.hi.plain
         udata.message = ${metric_name} on ${component_name} for ${device_name} is at ${metric_value} ${metric_unit}.
         udata.values.metric_name = ${metric_name}
         udata.values.component_name = ${component_name}
         udata.values.device_name = ${device_name}
         udata.values.metric_value = ${metric_value}
         udata.values.metric_unit = ${metric_unit}
         udata.values.threshold_value = ${threshold_value}
         udata.values.op = ${threshold_operator}
         udata.values.sampletime = [msg.udata.sampletime]
         udata.values.samplevalue = [msg.udata.samplevalue]
         udata.source = ${device_name}
         udata.policy_id = ${policy_id}
      </alarm_fields>
   </static_1>
   <tot_1>
      thresh_type = static
      tot = 1
      ttt = 0
      clear = 0
      name = static_tot
      group = standard_1
      <alarm_fields>
         udata.token = as#standard.alarm.format.tot
         udata.message = ${metric_name} on ${component_name} for ${device_name} is at ${metric_value} ${metric_unit}. It has violated the threshold for at least ${tot_slider_window} ${tot_sliding_window_unit} out of ${tot_time_window} ${tot_time_window_unit}.
         udata.values.metric_name = ${metric_name}
         udata.values.component_name = ${component_name}
         udata.values.device_name = ${device_name}
         udata.values.metric_value = ${metric_value}
         udata.values.metric_unit = ${metric_unit}
         udata.values.tot_time_frame = ${tot_time_window}
         udata.values.tot_slider = ${tot_slider_window}
         udata.values.tot_time_frame_units = ${tot_time_window_unit}
         udata.values.tot_slider_units = ${tot_sliding_window_unit}
         udata.values.threshold_value = ${threshold_value}
         udata.values.op = ${threshold_operator}
         udata.policy_id = ${policy_id}
         udata.values.sampletime = [msg.udata.sampletime]
         udata.values.samplevalue = [msg.udata.samplevalue]
         udata.source = ${device_name}
      </alarm_fields>
   </tot_1>
   <clear_1>
      thresh_type = static
      tot = 0
      ttt = 0
      clear = 1
      name = static_clear2
      group = standard_1
      <alarm_fields>
         udata.token = as#standard.alarm.format.clear
         udata.message = ${metric_name} on ${component_name} for ${device_name} is OK.
         udata.values.metric_name = ${metric_name}
         udata.values.component_name = ${component_name}
         udata.values.device_name = ${device_name}
         udata.values.sampletime = [msg.udata.sampletime]
         udata.values.samplevalue = [msg.udata.samplevalue]
         udata.source = ${device_name}
      </alarm_fields>
   </clear_1>
</default_alarms>
 
 

Environment

Release : 20.3

Component : UIM - ALARM POLICY

Windows Server 2016

Red Hat Enterprise Linux Server release 7.9

Microsoft SQL Server 2016

Resolution

Found the following in policy_management.log on OC system indicating issue reaching adminconsole on primary hub via HTTP:

C:\Program Files (x86)\Nimsoft\probes\service\wasp\policy_management.log

Error  DEBUG message:

2021-01-08 15:47:14,737 DEBUG com.ca.uim.policy.management.events.service.HeartBeatService:registerThisNode:243 [Timer-1]   - Registering the policy node to [email protected]://10.40.130.10:0/adminconsoleapp

Needed to check policy from SQL DB

A. SQL DB: Ran queries to check for pending or OK

select * from Policy

- Policies were showing up as pending (not working)

B.  Go to Primary Hub;

1. wasp.cfg, go to raw configure:

2. changed http_port from blank to 80

3. restarted wasp (need http disabled and only use https if possible)

C. Outcome after steps

1. Now /opt/nimsoft/plugins/plugin_metric/plugin_metric.cfg  is getting updated

2. alarms are generated, alarm policies can be deleted

3. SQL DB: Ran queries to check for pending or OK

select * from Policy

- Policies were showing up as OK (all working)

Additional Information

Delete button for alarm policy is working.

1. enable alarm

2. delete alarm button

 

Attachments

1610379010946__alarm policy setup.PNG get_app