search cancel

Code a rule to email when deadlock messages repeat for DB2 Systems

book

Article ID: 215319

calendar_today

Updated On:

Products

OPS/MVS Event Management & Automation

Issue/Introduction

Create an OPS rule to send an email if we see 2 or more messages within an hour in one of two DB2 systems on two specific LPARs. These messages indicate DEADLOCK, TIMEOUT LOCK REQUEST COULDN’T BE GRANTED, and DRAIN FAILED.
You can’t just look in the syslog because we have another DB2 subsystem running there unless you can also check the DB2 subsystem in the first line.

Only need one email per reason per hour even if the message occurs repeatedly during the one hour period.

Email alerts multiple addresses.

There are three of them and they’re listed below with the single quotes around them.  The quotes are not part of the message.  We only want the OPS rule to trap messages in either of two specific DB2 systems.  You can’t just look in the syslog because we have another DB2 subsystem running there unless you can also check the DB2 subsystem in the first line.  The first line of the entire message block identifies the DB2 subsystem.   

‘REASON 00C90088’

‘REASON 00C9008E’

‘REASON 00C900BA’

 

Environment

Release : 13.5

Component : OPS/MVS

Resolution

Below is some guidance toward what to do.  Support is happy to provide troubleshooting and guidance as needed, but coding should be pursued under a paid services contract arranged with the account representative.   If troubleshooting is required, please provide a copy of the rule as currently coded, along with a tersed copy of the archived OPSLOG for evaluation.

For the above request: 

the first thing is that this should be a MLWTO rule so that all lines will be available at the same time in the rule code.  Below are things to consider:


)MSG DSNT501I MLWTO

 

-For MLWTO rules each line of the message will be in variables like msg.text.i where "i" is the line number.
-If only DBR0 and DBR1 are to be monitored the first thing the rule should check is the presence of one of these strings in the first line of the message, that is, in the variable msg.text.1:

monitored_subsystems = 'DBR0 DBR1'                 
firstline = msg.text.1       /* msg.text.1 because we know the subsystem is in line 1 */                      
parse var firstline . "-" subsys .       /* parses the subsystem to a variable named 'subsys' */ 
  
Checks if subys is among the monitored subsystems. If not, just finish the rule execution       


if pos(subsys,monitored_subsystems) = 0 then return

 

-Regarding only alerting when more than 2 error messages with the same code appear, you can use the OPSTHRSH function:

reason = msg.text.5        /* reason is at the line 5 of the message */                           
                               
cnt = OPSTHRSH('A','3600',reason,,5)  
 
cnt will contain the number of times this same code appeared in an interval of 3600 seconds (1 hour)
The last argument '5' is the number of the line of the MLWTO where the reason code is.    

         
if cnt > 2 then do                                    
            /* include here code to send the alert */ 
            end                                    

Reference to the OPSTHRSH function:

https://techdocs.broadcom.com/us/en/ca-mainframe-software/automation/ca-ops-mvs-event-management-and-automation/14-0/reference-information/command-and-function-reference/ops-rexx-built-in-functions/opsthrsh-function.html