Automatically validate hdb and spooler probes or other probes via script

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) Unified Infrastructure Management for Mainframe DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

The following script can verify any probe for which controller has generated an alarm like the following:

Controller: Probe 'spooler' FAILED to start, file check determines changes in the probe

Environment

Release: Any
DX UIM
Migration to new servers with new hostnames/IPs/IP address
UIM installations/upgrades/migration
bulk changes, en masse

Cause

spooler alarm:

Controller: Probe 'spooler' FAILED to start, file check determines changes in the probe

This type of issue is usually seen following an IP address or hardware change in the environment.
There is a 'start up' order for probes on a robot. Some of them have dependencies on other probes.
Occasionally, the probes will start out of order and cause issues like this. That is usually seen on systems with resource issues but can simply be caused by the controller robotip being set to the wrong ip.
hdb, and spooler may fail, but other probes on the robot may also fail and cause alarms to be generated, e.g. cdm, ntservices, ntevl, nexec, etc.

Resolution

Step 1: Create script

In Nas auto operator->Scripts section, create a new script with following contents:

-- Start of script

al = alarm.list() -- Get alarm list

re = "%p%a+%d*_*%a*%d*%p" -- Regex to match probe name with alpha, numbers and underscore

if al ~= null then

for i = 1,#al do

if al[i].prid == "controller" then -- First, filter to get alarms from controller probe only

if string.match(al[i].message,"FAILED to start") then -- Second, filter to get controller alarms with specific text i-e "FAILED to start"

probe = string.gsub(string.match(al[i].message,re),"'","") -- Get probe name from alarm message and then remove quotes from probe name to use in probe_verify callback

--print(al[i].message.."! Probe-> "..probe) -- View alarms with probe names which failed to start

addr = "/"..al[i].domain.."/"..al[i].hub.."/"..al[i].robot.."/".."controller" -- Build Nimsoft address

-- printf("/"..al[i].domain.."/"..al[i].hub.."/"..al[i].robot.."/".."controller".."<->Probe="..al[i].prid) -- Print Nimsoft address(es)

-- Now run the probe_verify callbacks on each probe which FAILED to start

local args = pds.create()

pds.putString(args,"name",probe)

nimbus.request(addr,"probe_verify",args)

nimbus.request(addr,"probe_activate",args)

pds.delete(args)

sleep (100) -- A little delay between each probe callback

end

-- End of script

Note: for troubleshooting the script the preceding '--' can be removed so it will return the name of the robot and probes.

--print(al[i].message.."! Probe-> "..probe) -- View alarms with probe names which failed to start

-- printf("/"..al[i].domain.."/"..al[i].hub.."/"..al[i].robot.."/".."controller".."<->Probe="..al[i].prid) -- Print Nimsoft address(es)

Step 2: Setup nas profile

Setup a nas auto operator profile with following settings:

a. matching criteria:

severity = major

probe = controller
matching text = /.*Probe.*\sFAILED\sto\sstart.*/

b. Action type: script

Script = choose the script created in step 1 from the drop down list

c. Action Mode: 'On overdue age' with your desired time settings

Click Ok and Apply to save changes and restart nas probe. Now, the above settings will automatically verify probes if they failed to start due to checksum change detection.

This should be paired with a nas AO profile that matches the alarm with a regex like: /^Probe.*FAILED.*/

IMPORTANT:

These alarms only generate upon initial failure, so if the alarms are acknowledged they wont be available/regenerated. Therefore, to ensure the script can leverage the use of the alarms as a trigger for the script to be run for each probe in an error state on the robot(s), if you used a probe package to reconfigure your robots to point to new hubs, you can simply redeploy the probe/robot_update package deployment that caused the original alarms to the subset of robots throwing the Controller: Probe '<probe_name>' FAILED to start, file check determines changes in the probe alarms.

Tip: If you enable the Group node in IM via Tools->Options and create an Infrastructure Group that includes the robots where some probes are throwing the alarm mentioned above, you can deploy your robot_update package only to those robots to force the Probe FAILED... alarms to regenerate if you had already accidentally acknowledged them. Then the script will pick up on those alarms and the script will be triggered and repair those probes and the alarms will clear within a 5 or so minutes.

Additional Information

The below script stub is an alternate/simplified version of this script which can be put in place on an Auto-Operator, e.g. "On arrival", or (for more reliability) "Overdue age 5m" and will automatically respond to such an alarm by extracting the robot and probe name and validating the probe:

--

myalarm = alarm.get()
local args = pds.create()
local nimaddress = "/"..myalarm.domain.."/"..myalarm.hub.."/"..myalarm.robot.."/".."controller"
probename = string.match(myalarm.message,"\'(.*)\'")
pds.putString(args,"name",probename)
nimbus.request(nimaddress,"probe_verify",args)
nimbus.request(nimaddress,"probe_activate",args)
pds.delete(args)

--

How to move hundreds of robots to a new hub

https://knowledge.broadcom.com/external/article/234611