DC015007 SOS abends or over-use of resources by OLQ

Products

IDMS IDMS - Database

Issue/Introduction

How to prevent DC015007 SOS abends or over-use of resources by OLQ.

Environment

Release: All supported releases.
Component: OLQ.

Resolution

OLQ satisfies logical data requests by performing physical I/Os. The amount of resources used by any task (including OLQ) can be controlled by limits specified in the sysgen for that task definition. OLQ in addition has an interrupt count which can be specified to control the frequency with which rows are returned, at which point the user has the option of continuing or ending the transaction.

In some situations, OLQ can use up very many resources before hitting an interrupt count, causing it to reach DBIO or DCCALLS limits. This can occur because the limits and the interrupt count track very different things: the I/O counter (used to check the DBIO limits) tracks the number of I/Os performed, and the Interrupt Count tracks the number of rows returned to OLQ. The relationship between these is most easily explained with an example. Assume someone wants to find all the clients within a certain zip code, and then to find out which offices exist within that state to satisfy an order. That would be easy to set up on OLQ, particularly menu mode; simply pick the client and office records, specify a WHERE criteria on the client zip, and a join clause between the client state and the state in the office record. That would generate a query that follows this general form:

SELECT * FROM CLIENT, OFFICE
WHERE CLIENT_ZIP = nnnnn and CLIENT.STATE = OFFICE.STATE;

Now, assume that there are 1,000,000 clients, and 5 offices in each state. In this query, zip and state are unlikely to be key fields, so there are no indexes or calc keys to help in this retrieval. That means there will be area sweeps on two records, and joining them in a nested loop. The optimizer will determine the most efficient way to retrieve these records. For the sake of discussion, assume that the path will be to sweep the Client area for clients within this zip code, and then walk the office area for offices that satisfy the join criteria on state. That means the path will be:

Walk the Client area until a client with the specified zip code is found, then walk the Office area until an office with the client's state is found. (Assume the first client record with the desired zip code is found about 1/2 way through the area);
Walk the office area until an office whose state matches the state in the selected client record is found. (Assume most of the office records have to be read before finding the first one within the client's state.)
At this point. the I/O counter is set to some-where around 500,200 because that is the total number of records read; however, the interrupt count counter is set to 1 because only only one record occurrence has been returned to OLQ.
After returning this first complete row, OLQ continues to walk the office area, and it returns one record to OLQ for every office record with a match on that state. It will increment the I/O counter for every record it reads, and it will increment the interrupt count counter for every row it returns to OLQ.
When the end of the office area is reached, return to walking the client area, resuming with the next client record.
When the next client in this zip code is found, start over doing an area sweep on the office records to find one within that state.
Because a new walk of the office area is being started for every client found (nested loop), and because it's a large area containing the clients, this could end up with a very large number of I/Os (and calls to the I/O routines) while the interrupt count counter remains relatively low.

The interrupt count provides an opportunity for the end user to terminate the task and thus avoid consuming too many resources, but often the user is completely unaware of the resources required to satisfy the logical request. That's because the amount of resources required to satisfy the user query depends completely on the query that's coded, the database statistics available, and the database design. So the interrupt count is not a reliable method to manage resource utilization; instead, to avoid high resource utilization by OLQ, most sites establish limits for OLQ just the same as for other user applications. The only way to ensure that long-running OLQ queries do not consume substantial resources at run-time is to establish limits, either at the system or task level; and to establish an interrupt count that does not interfere with these limits being reached (see more on this below). The same approach is often used for Server tasks and tasks that process generalized SQL queries, since these requests are all logical and do not specify any actual physical I/O DML commands. Often users are unaware of what resources are required to satisfy such a query, so setting limits is the only way to avoid tasks that use too many resources.

The unique factor which differentiates OLQ in this regard is that both the call limits and the interrupt count can have an impact on this process. That's because when the interrupt count is reached, the resource counters for the task (calls, I/Os, etc) are reset to zero. Thus if the interrupt count is set too low, it is possible for the resource counters to be continually reinitialized, so that the specified limits are never reached. A few actions will help prevent this from happening:

In the sysgen OLQ statement, specify a moderately high interrupt count; this will help increase the odds that for resource-intensive tasks, the task or system limits are reached first;
Do not use exits to extend the limits for OLQ when they are reached (this is sometimes done for system tasks, but OLQ functions as a user application, so it should not be done here);
Specify limits specific to OLQ on the task statement, to avoid using the more generic system-level limits.
Monitor the success of how the interrupt count and OLQ task limits interact. For some sites, depending on database designs and the types of queries that are issued, the interrupt count will need to be adjusted so that the limit counters are not reset before the limits are reached by resource-intensive tasks.

Additional Information

System Generation
OLQ