ALERT: Some images may not load properly within the Knowledge Base Article. If you see a broken image, please right-click and select 'Open image in a new tab'. We apologize for this inconvenience.

Automatically validate hdb and spooler probes via script

book

Article ID: 34372

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) NIMSOFT PROBES

Issue/Introduction

The following script can verify any probe for which controller has generated an alarm like the following:
 
Jun 3 10:59:48:951 [3086726848] Controller: Probe 'spooler' FAILED to start, file check determines changes in the probe?
 
This type of issue is usually seen following a IP address or hardware change in the environment.
There is a 'start up' order for probes on a robot. Some of them have dependencies on other probes.
Occasionally, the probes will start out of order and cause issues like this. That is usually seen on systems with resource issues.


Environment

Release:
Component: CAUIM

Resolution

Step 1: Create script


 


In Nas auto operator->Scripts section, create a new script with following contents:


 


-- Start of script


al = alarm.list() -- Get alarm list

re = "%p%a+%d*_*%a*%d*%p" -- Regex to match probe name with alpha, numbers and underscore

if al ~= null then
for i = 1,#al do
if al[i].prid == "controller" then -- First, filter to get alarms from controller probe only

if string.match(al[i].message,"FAILED to start") then -- Second, filter to get controller alarms with specific text i-e "FAILED to start"

probe = string.gsub(string.match(al[i].message,re),"'","") -- Get probe name from alarm message and then remove quotes from probe name to use in probe_verify callback
--print(al[i].message.."! Probe-> "..probe) -- View alarms with probe names which failed to start

addr = "/"..al[i].domain.."/"..al[i].hub.."/"..al[i].robot.."/".."controller" -- Build Nimsoft address
-- printf("/"..al[i].domain.."/"..al[i].hub.."/"..al[i].robot.."/".."controller".."<->Probe="..al[i].prid) -- Print Nimsoft address(es)

-- Now run the probe_verify callbacks on each probe which FAILED to start

local args = pds.create()
pds.putString(args,"name",probe)
nimbus.request(addr,"probe_verify",args)
nimbus.request(addr,"probe_activate",args)
pds.delete(args)
sleep (100) -- A little delay between each probe callback
end
end
end
end


-- End of script

Note: for troubleshooting the script the preceding '--' can be removed so it will return the name of the robot and probes. 
--print(al[i].message.."! Probe-> "..probe) -- View alarms with probe names which failed to start
-- printf("/"..al[i].domain.."/"..al[i].hub.."/"..al[i].robot.."/".."controller".."<->Probe="..al[i].prid) -- Print Nimsoft address(es)
 


 


Step 2: Setup nas profile


 


Setup a nas auto operator profile with following settings:


a- matching criteria:


severity = major


probe = controller
matching text = /.*Probe.*\sFAILED\sto\sstart.*/


 


b- Action type: script


Script = choose the script created in step 1 from the drop down list


 


 


c- Action Mode: On Overdue age with your desired time settings
 
 


Click Ok and Apply to save changes and restart nas probe. Now, the above settings will automatically verify probes if they failed to start due to cheksum change detection.

Additional Information

The below is an alternate/simplified version of this script which can be put in place on an Auto-Operator, e.g. "on arrival", or (for more reliability) "Overdue age 5m" and will automatically respond to such an alarm by extracting the robot and probe name and validating the probe:

--

myalarm = alarm.get()
local args = pds.create()
local nimaddress = "/"..myalarm.domain.."/"..myalarm.hub.."/"..myalarm.robot.."/".."controller"
probename = string.match(myalarm.message,"\'(.*)\'")
pds.putString(args,"name",probename)
nimbus.request(nimaddress,"probe_verify",args)
nimbus.request(nimaddress,"probe_activate",args)
pds.delete(args)

--

This should be paired with an AO that matches the alarm with a regex like:  /^Probe.*FAILED.*/