How does the sqlserver "agent_job_failure" checkpoint work?


Article ID: 34961


Updated On:


NIMSOFT PROBES DX Infrastructure Management


This article provides a simple example so you can have a better understanding on how the "agent_job_failure" checkpoint works.

How does the "sqlserver" UIM probe determine "agent_job_failure"?


Component: UIMSQS


So, let's take a look at the query:

select h.job_id as job_id, as job_name, category_name,
dateadd(hh,run_time/10000,dateadd(mi,run_time%10000/100,dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))) as rundate, datediff(minute, dateadd(hh,run_time/10000, dateadd(mi,run_time%10000/100, dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))), getdate()) as elapsed_time
from msdb.dbo.sysjobhistory h INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id
inner join msdb.dbo.syscategories c on j.category_id = c.category_id
where c.category_class = 1 and h.run_status <> 1 and h.step_id = 0 and
dateadd(hh,run_time/10000, dateadd(mi,run_time%10000/100, dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))) > dateadd(ss,-60*60*36,getdate())
order by elapsed_time

This is the query used by the checkpoint and the one we'll be using in our example to see values such as:

- job_id, job_name, category_name, rundate and elapsed_time.

Above all, "elapsed_time" is the most important one here.

By definition, elapsed_time is equal to Current time - Job started time

When we set a threshold in our agent_job_failure checkpoint, what we are doing in reality is set a value (in minutes) which will generate an alarm for every job failure whose elapsed_time is lower than the threshold.

In other words, let's say we set a threshold of 5 (5 minutes).

We create a job failure that will be running for 5 minutes (it will fail once per minute) as follows:

1) We create a job failure and set a schedule that goes from 5:00 p.m to 5:05 p.m with an interval of 1 minute. So, we'll be having a job failing 5 times within 5 minutes.

2) Let's say we set our "check_interval" parameter in our agent_job_failure checkpoint to 1 (it will check the checkpoint every 1 minute).

3) Let's write down the times at which our job is failing:

Let's assume our current time is 5:06 p.m.

At 5:00, job failed, elapsed_time (current time - time job failed) = 6?
At 5:01, job failed, elapsed_time = 5
At 5:02, job failed, elapsed_time = 4
At 5:03, job failed, elapsed_time = 3
At 5:04, job failed, elapsed_time = 2
At 5:05, job failed, elapsed_time = 1

With this scenario and our checkpoint's check interval equal to 1, the first job will be cleared at 5:06.

Why? The same way that in order to have an alarm for a job failure the elapsed_time has to be lower than the threshold, for a job failure to have a clear alarm, the elapsed_time must be higher than the threshold.

So, at 5:06, there's just one elapsed_time higher (6) than the threshold, which is 5, and this is the time when the first job that failed will be cleared.

At 5:07, we'll have, elapsed_time from the first job failure equal to 7, from the second, equal to 6, from the third, equal to 5, and so on.

So, at 5:07, we'll just have 4 alarms remaining.

At 5:08, we'll have 3 alarms remaining and eventually, at 5:11, we should have no alarms remaining in our Alarm Window since all the jobs that failed have their elapsed_time higher than the threshold (5 in this example).

Additional Information

Important key points to mention:

The Hint Editor mentions that this checkpoints gets the failed jobs data within the last 36 hours. This means that the checkpoints retrieves the last 36 hours of failed jobs from the SQL Server database every time it runs. And, that also means that, you can't keep track of job that failed two days ago. Maximum time you can keep track of a job failing is 36 hours and, in order to get to that time, you have to increase the threshold to 2160 (number of minutes in 36 hours).

Jobs get cleared when their elapsed_time is higher than the threshold, not when the job turns from failure to success.

Finally, above all, in order to have alarms cleared, we must set the key "clear_alarms" to 1 in the probe's Raw Configure.