Introduction:
This document describes how deadlocks are detected and handled in CA-IDMS Central Versions which participate in a Data Sharing Group.
Environment:
IBM z/OS
Answer:
In a Data Sharing environment, deadlocks can occur between tasks running on different CVs which are members in the same Data Sharing Group (DSmembers).
The process for initiating a global deadlock detection cycle begins when a deadlock detection interval on one of the DS members expires causing the local deadlock detector on that CV to "wake up".
This deadlock detector will then locate all stalled (waiting) tasks within its CV and look at each one to determine what type of resource it is waiting on. If there are no tasks waiting on any global resources (resources within the sysplex), the deadlock detector stays as a local deadlock detector and goes thru the process of catching and resolving deadlocks within its CV.
If there is at least one task stalled on a global resource, the deadlock detector tries to become the global deadlock detector. It does this by trying to acquire a global deadlock detector resource lock. If it can be acquired, then there is no global deadlock detector (GDD) active within the Data Sharing Group, so this CV (by successfully acquiring the lock) has just become the active GDD. If the lock can not be acquired, then there is another DSmember currently acting as the GDD and this one goes into a wait, waiting for the GDD to request information on the tasks within this CV.
After the GDD finishes and there are no more deadlocks to resolve, it releases the global lock it had acquired and goes back into a wait until the next deadlock detection interval expires, and the process starts all over again.
The (DS member acting as) GDD does its job in 4 distinct steps.
Steps 3) and 4) are repeated until no more tasks are deadlocked at which point the deadlock cycle ends.
As a result, messages written out during deadlock detection can/may be spread over the DClogs of all DSmembers.
To clarify this, some of these messages have been "extended" with the system-id which identifies the involved DSmember. These "extended" messages are DC001004, DC001005 and DC001006.
The Global Deadlock Detector writes out following messages:
DC001000 - Local stalled tasks and what they're waiting on
DC001004 - Remote stalled tasks and what they're waiting on
DC001002 - Local victims
DC001005 - Remote victims
The Local Deadlock Detector writes out following messages:
DC001006 - Local tasks that were selected as victim by global deadlock detector
Note:
"Local stalled tasks": are tasks running on the DSmember which acts as GDD
"Remote stalled tasks": are tasks running on another DSmember in the Data Sharing group.
Example:
A Data Sharing group consists of two DSmembers: SYSTEM25 and SYSTEM45
At a given moment in time, a user task (taskid 98317) runs on SYSTEM25 and another one (taskid 215854) runs on SYSTEM45. Both are accessing the same database who's areas are defined in Data Sharing.
Assume that the first Deadlock detection Interval which expires is the one on SYSTEM25.
At that time, SYSTEM25 has a stalled task, and this task waits on a global resource, i.e. a dbkey of a record which is maintained in Data Sharing. SYSTEM25 succeeds to acquire the global deadlock detector resource lock and becomes the Global Deadlock Detector (GDD).
It collects information from the other DSmember, SYSTEM45, and analyses it.
SYSTEM25 detects a global deadlock, determines which task to select as victim and writes following messages into its DClog:
DC001000 V025 T15 T:000098317 TSKPURCH P:DLPURCH C:DEAD WAITING ON...
which is a local task (i.e. running on SYSTEM25)
DC001004 V025 T15 SYSTEM45 T:000215854 TSKORDER P:DLORDERD C:DEAD WAITING ON...
which is a remote task, running on the other DSmember, SYSTEM45
DC001005 V025 T15 SYSTEM45 T:000215854 TSKORDER P:DLORDERD C:DEAD WAITING ON...
which is chosen as victim (= to be abended). In this case, task 215854 was running on DSmember SYSTEM45.
The GDD informs the other DSmember (= Local Deadlock Detector) about this and requests it to cancel that remote task.
DSmember SYSTEM45 terminates that task, and writes following message into its DClog:
DC001006 V045 T15 SYSTEM25 T:000215854 TSKORDER P:DLORDERD C:DEAD...
This message usually is followed by messages describing the rolled out transaction, such as:
DC203005 V045 T215854 Program-ID DLORDERD Transaction-ID 55606289 has been Rolled Out!
DC203005 V045 T215854 SUBSCNAM User id FE - ID1 FE - ID2 FE - ID3 FE Tskcd
DC203005 V045 T215854 DBORDSSC U12345
DC173008 V045 APPLICATION ABORTED. BAD IDMS STATUS RETURNED; STATUS=0229
Notes: