SQLServer Probe - AgentJob Failure alarming on all failed jobs
search cancel

SQLServer Probe - AgentJob Failure alarming on all failed jobs

book

Article ID: 138076

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

We have configured SQLServer Monitoring and enabled agent_job_failure checkpoint, but this is not giving the expected alarms. 

I tried with min threshold and max threshold, but the alarm count never gets reduced. 

It is generating a flood of alarms and impacting UIM performance.

I have compared the alarm results with the SQL query results following the link:

KB Article-> 34961

 

Environment

Environment Details:

  • UIM Version 9.0.2
  • SQLServer - 5.42
  • Robot 7.96

Cause

  • The client had deployed an incorrect configuration package to the problem robots.

 

Resolution

 It appears you have applied a configuration package to the probe. this had a default setting of unit = days that caused the issue. 

Once we reset the value back to unit =

we were able to test and have this work as expected.

 <agent_job_failure>
    active = no
    send_alarm = yes
    description = Monitors failed agents jobs in defined interval (in minutes).
    qos = no
    qos_list = no
    clear_msg = failed_jobs_1
    clear_sev = clear
    scheduling = rules
    column = elapsed_time
    key = $job_id.$category_name.$rundate
    exclude_defs = yes
    include_defs = yes
    use_exclude = no
    use_include = no
    condition = <=
    samples = 1
    clear_alarms = 1
    msg_variables = $check.x;$profile.x;$instance.x;$job_id.x;$job_name.x;$category_name.x;$rundate.x;$elapsed_time.n
    interval = 5 min
    type = 2
    sql_timeout =  
    <thresholds>
       <default>
          <0>
             tagid = 0
             value = 5
             unit = days
             sev = critical
etc
etc