How does the sqlserver "agent_job_failure" checkpoint work?

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

This article provides a simple example so you can have a better understanding on how the "agent_job_failure" checkpoint works.

How does the "sqlserver" UIM probe determine "agent_job_failure"?

Environment

Release:
Component: UIMSQS

Resolution

So, let's take a look at the query:

MS SQL Server 2012 or higher:

SET DEADLOCK_PRIORITY LOW  SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED select h.job_id as job_id, j.name as job_name, c.name category_name, dateadd(hh,run_time/10000,dateadd(mi,run_time%10000/100,dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))) as rundate, datediff(minute, dateadd(hh,run_time/10000, dateadd(mi,run_time%10000/100, dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))), getdate()) as elapsed_time from msdb.dbo.sysjobhistory h  INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id inner join msdb.dbo.syscategories c on j.category_id = c.category_id where c.category_class = 1 and h.run_status <> 1 and h.step_id = 0 and  dateadd(hh,run_time/10000, dateadd(mi,run_time%10000/100, dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))) > dateadd(ss,-606036,getdate()) order by elapsed_time

This is the query used by the checkpoint and the one we'll be using in our example to see values such as:

- job_id, job_name, category_name, rundate and elapsed_time.

Above all, "elapsed_time" is the most important one here.

By definition, elapsed_time is equal to Current time - Job started time

When we set a threshold in our agent_job_failure checkpoint, what we are doing in reality is set a value (in minutes) which will generate an alarm for every job failure whose elapsed_time is lower than the threshold.

In other words, let's say we set a threshold of 5 (5 minutes).

We create a job failure that will be running for 5 minutes (it will fail once per minute) as follows:

1) We create a job failure and set a schedule that goes from 5:00 p.m to 5:05 p.m with an interval of 1 minute. So, we'll be having a job failing 5 times within 5 minutes.

2) Let's say we set our "check_interval" parameter in our agent_job_failure checkpoint to 1 (it will check the checkpoint every 1 minute).

3) Let's write down the times at which our job is failing:

Let's assume our current time is 5:06 p.m.

At 5:00, job failed, elapsed_time (current time - time job failed) = 6?
At 5:01, job failed, elapsed_time = 5
At 5:02, job failed, elapsed_time = 4
At 5:03, job failed, elapsed_time = 3
At 5:04, job failed, elapsed_time = 2
At 5:05, job failed, elapsed_time = 1

With this scenario and our checkpoint's check interval equal to 1, the first job will be cleared at 5:06.

Why? The same way that in order to have an alarm for a job failure the elapsed_time has to be lower than the threshold, for a job failure to have a clear alarm, the elapsed_time must be higher than the threshold.

So, at 5:06, there's just one elapsed_time higher (6) than the threshold, which is 5, and this is the time when the first job that failed will be cleared.

At 5:07, we'll have, elapsed_time from the first job failure equal to 7, from the second, equal to 6, from the third, equal to 5, and so on.

So, at 5:07, we'll just have 4 alarms remaining.

At 5:08, we'll have 3 alarms remaining and eventually, at 5:11, we should have no alarms remaining in our Alarm Window since all the jobs that failed have their elapsed_time higher than the threshold (5 in this example).

Additional Information

Important key points to mention:

The Hint Editor mentions that this checkpoints gets the failed jobs data within the last 36 hours. This means that the checkpoints retrieve the last 36 hours of failed jobs from the SQL Server database every time it runs. And, that also means that you can't keep track of a job that failed two days ago. Maximum time you can keep track of a job failing is 36 hours and, in order to get to that time, you have to increase the threshold to 2160 (number of minutes in 36 hours).

Points to keep in mind:

Jobs get cleared when their elapsed_time is higher than the threshold value, NOT when the job turns from failure to success
The underlying query monitors job failures - it does NOT monitor successful jobs
Finally, above all, to have alarms cleared, set the key "clear_alarms" to 1 in the probe's Raw Configure

Alternatively, the sqlserver probe allows you to create your own custom query.

https://knowledge.broadcom.com/external/article/39422/how-to-create-a-new-custom-checkpoint-us.html

You might be able to clear alarms with some more flexibility using a stored procedure that queries job activity like the one described at this link-> https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-help-jobactivity-transact-sql?view=sql-server-ver16

It does track the success or failure of a job.

run_status

int Status returned from the last run of the job:

0 = Error failed

1 = Succeeded

3 = Canceled

5 = Status unknown