Problem:
The sqlserver probe is installed on a robot to remotely monitor MS SQL Servers. When something happens resulting in the database being unavailable such as a hardware crash, the probe fails to send the alarm for the check_dbalive checkpoint.
Instead he customer gets many messages like this:
2016-01-04 07:58:05OPENProfile <Profile Name>, failed to execute in scheduled time interval, delayed by 308 seconds1SQL-Serverminor
'failed to execute in scheduled time interval' indicates the probe failed to complete all the checkpoints within the configured timeouts.
Potentially this will effect all revisions of the probe and SQL Server.
The sqlserver probe has several configurable timeouts which limit the time to process all checkpoints. When the timeout is reached the probe will stop processing the checkpoints and generate the above message. check_dbalive being one of the checkpoints could be excluded due to the timeouts.
The timeouts can be increased, or since checkpoints are processed in order, the check_dbalive checkpoint can be moved to the top so it is processed first.
Edit C:\Program Files (x86)\Nimsoft\probes\database\sqlserver\sqlserver_monitor.cfg by moving the section for <check_dbalive> to the top of the <checkpoints> section like this:
<groups>
<UMP>
description = To fill default UMP dashboards
<checkpoints>
<check_dbalive>
active = yes
description = Monitors connectivity to the database instance
qos = yes
qos_list = yes
clear_msg = check_dbalive_1
clear_sev = clear
interval = 5 min
sql_timeout =
scheduling = rules
use_exclude = no
use_include = no
samples = 1
<thresholds>
<default>
<0>
tagid = 0
value = 1
unit =
sev = major
msg = check_dbalive_2
condition = !=
clear_msg = check_dbalive_1
scheduling =
key_col_name =
key_col_value = default
</0>
</default>
</thresholds>
<qos_lists>
<0>
qos_name = check_dbalive
qos_desc = SQL Server Availability
qos_unit = Availability
qos_abbr = Avail.
qos_max = 1
qos_value = status
qos_key =
</0>
</qos_lists>