Deadlock detection in a Data Sharing Environment

Products

IDMS IDMS - Database

Issue/Introduction

This document describes how deadlocks are detected and handled in CA-IDMS Central Versions which participate in a Data Sharing Group.

Environment

Release: All supported releases.

Resolution

In a Data Sharing environment, deadlocks can occur between tasks running on different CVs which are members in the same Data Sharing Group (DSmembers).

The process for initiating a global deadlock detection cycle begins when a deadlock detection interval on one of the DS members expires causing the local deadlock detector on that CV to "wake up".

This deadlock detector will then locate all stalled (waiting) tasks within its CV and look at each one to determine what type of resource it is waiting on. If there are no tasks waiting on any global resources (resources within the sysplex), the deadlock detector stays as a local deadlock detector and goes thru the process of catching and resolving deadlocks within its CV.

If there is at least one task stalled on a global resource, the deadlock detector tries to become the global deadlock detector. It does this by trying to acquire a global deadlock detector resource lock. If it can be acquired, then there is no global deadlock detector (GDD) active within the Data Sharing Group, so this CV (by successfully acquiring the lock) has just become the active GDD. If the lock can not be acquired, then there is another DSmember currently acting as the GDD and this one goes into a wait, waiting for the GDD to request information on the tasks within this CV.

After the GDD finishes and there are no more deadlocks to resolve, it releases the global lock it had acquired and goes back into a wait until the next deadlock detection interval expires, and the process starts all over again.

The (DS member acting as) GDD does its job in four distinct steps.

It asks the other DSmembers if they have stalled tasks. The other DS members return that information.
It then asks the other DSmembers who owns the resources their stalled tasks are waiting on. The other DS members return that information too.
The GDD analyzes the task interdependencies gathered in steps 1 and 2 to identify deadlocks. If no deadlocks exist, the deadlock cycle ends. Otherwise, one or more tasks must be abended in order to resolve the deadlock.
The GDD chooses a victim and
1. abends the task directly if it runs on that DSmember, or
2. informs the other DSmember(s) to do so.

Steps 3) and 4) are repeated until no more tasks are deadlocked at which point the deadlock cycle ends.

As a result, messages written out during deadlock detection can/may be spread over the DClogs of all DSmembers.

To clarify this, some of these messages have been "extended" with the system-id which identifies the involved DSmember. These "extended" messages are DC001004, DC001005 and DC001006.

The Global Deadlock Detector writes out following messages:

DC001000 - Local stalled tasks and what they're waiting on
DC001004 - Remote stalled tasks and what they're waiting on
DC001002 - Local victims
DC001005 - Remote victims

The Local Deadlock Detector writes out following messages:

DC001006 - Local tasks that were selected as victim by global deadlock detector

Note:

"Local stalled tasks": are tasks running on the DSmember which acts as GDD
"Remote stalled tasks": are tasks running on another DSmember in the Data Sharing group.

Example:

A Data Sharing group consists of two DSmembers: SYSTEM01 and SYSTEM02

At a given moment in time, a user task (taskid 98317) runs on SYSTEM01 and another one (taskid 215854) runs on SYSTEM02. Both are accessing the same database whose areas are defined in Data Sharing to be shared by both CVs.

Assume that the first Deadlock detection Interval which expires is the one on SYSTEM01.

At that time, SYSTEM01 has a stalled task, and this task waits on a global resource, i.e. a dbkey of a record which is maintained in Data Sharing. SYSTEM01 succeeds to acquire the global deadlock detector resource lock and becomes the Global Deadlock Detector (GDD).

It collects information from the other DSmember, SYSTEM02, and analyses it.

SYSTEM01 detects a global deadlock, determines which task to select as victim and writes following messages into its DClog:

DC001000 V1 T15 T:000098317 TSKPURCH P:DLPURCH C:DEAD WAITING ON...
which is a local task (i.e. running on SYSTEM01)

DC001004 V1 T15 SYSTEM02 T:000215854 TSKORDER P:DLORDERD C:DEAD WAITING ON...
which is a remote task, running on the other DSmember, SYSTEM02

DC001005 V1 T15 SYSTEM02 T:000215854 TSKORDER P:DLORDERD C:DEAD WAITING ON...
which is chosen as victim (= to be abended). In this case, task 215854 was running on DSmember SYSTEM02.

The GDD informs the other DSmember (= Local Deadlock Detector) about this and requests it to cancel that remote task.

DSmember SYSTEM02 terminates that task, and writes following message into its DClog:

DC001006 V2 T15 SYSTEM01 T:000215854 TSKORDER P:DLORDERD C:DEAD...

This message usually is followed by messages describing the rolled out transaction, such as:

DC203005 V2 T215854 Program-ID DLORDERD Transaction-ID 55606289 has been Rolled Out!
DC203005 V2 T215854 SUBSCNAM User id FE - ID1 FE - ID2 FE - ID3 FE Tskcd
DC203005 V2 T215854 DBORDSSC U12345
DC173008 V2 APPLICATION ABORTED. BAD IDMS STATUS RETURNED; STATUS=0229

Notes:

The "SYSTEM01" system-id in the DC001006 identifies the other DSmember which was acting as GDD in this case.
The next deadlock detection cycle could be handled by the other DSmember, SYSTEM02, or again SYSTEM01.

Additional Information

CA IDMS/DC Messages DC000000 - DC010007