Forward Recovery philosophy for CA 7

Products

Datacom Datacom/AD CA 7 Workload Automation

Issue/Introduction

We are looking at the use of hot backups (CAL2JCL memberAL2DBHOT) and how this works when using Datacom/AD Forward Recovery processing. We have reviewed Knowledge Base article 18722, titled "Overview of Datacom Forward and Backward Recovery" and Knowledge Base article 96542, titled "CA Workload Automation CA 7 Edition r12.0/12.1 Hot and Stable Backup and Recovery Procedures," but still have some other questions about this process.

This job uses Datacom DBUTLTY LOCK and UNLOCK OPTION=MOVER functions before and after the backup step, but it appears that this utility does not prevent CA 7 from continuing to make updates during the backup. Do I need to do anything else to stop update processes?
In doing the Forward Recovery, how do we determine the starting time for the RXX (Recovery File) that needs to be used?
Can 'rolled back' LUW's containing multiple updates can lead to inconsistencies in the recovered database?
The DBUTLTY RECOVERY statement has an option 'UPDATE=NO' which expects errors to happen during the recovery. Can these errors lead to inconsistent data in the tables, and if so, how does CA 7 deal with the errors?
Does CA 7 allow the use of the DBUTLTY QUIESCE TXN function to stop processing, and could a long-running LUW from CA 7 prevent this QUIESCE from becoming active?

Environment

Release : 15.1

Component : Datacom/AD

Component : CA Workload Automation CA 7 Edition

Resolution

First, the concept of data recovery for applications using Datacom/AD involves both the physical backup of the data areas at some point in time prior to an event, and the use of the Log File (LXX) and Recovery Files (RXX) to apply all the updates forward from the time of the backup to the time of the event. For your question about the rolled-back LUWs, every update is kept on the LXX with a before and after image and is maintained in a time-based sequence. Therefore if a table update is made and then rolled back, and then another update is made, the LXX has them in this sequence, and when doing a recovery, it will apply them to the database in that same sequence.

Under normal conditions, you will not need to deal with a Forward Recovery. This utility is used if you have a system failure (LPAR crash), a hardware failure (like a disk crash), or if something unusual happens in processing like running a major update set of jobs twice instead of once. These are all catastrophic events, and you would use a backup file together with the RXX (Recovery files) to restore your database to the desired point. Now, if for some reason your MUF fails during processing, we have the option for a Shadow MUF on another LPAR to assume control, and it picks up when the Primary MUF that you are using fails so that your application processing is not interrupted. If you do not use a Shadow MUF, once you resolve the problem that caused the MUF failure, restart the MUF and any in-flight transactions are automatically reprocessed without concern.

In your normal processing, you will run with application updates taking place throughout the day. Every add, change, or delete made to tables that are set up to support recovery is logged in the LXX (Logging Area). When the file reaches a certain point, all eligible records are dumped in an orderly fashion to the RXX. This spill process should happen through your automation routines, triggered by specific console messages. In your DBUTLTY Spill job, we would recommend you use this as the first input statement before the SPILL command:

  COMM OPTION=CONSOLE,OPTION2='WRITE_PENDS_LOG_STABLE'

This will ensure that the greatest number of LXX records that can be spilled is processed, with the result being a much quicker restart if your MUF fails.

Now, let's consider the case where your system fails due to a CPU or hardware failure and you need to recover your database. Once the system has restarted and your MUF has restarted and before starting your application, you will perform the following steps to recover your database.

Run a manual spill to extract any final committed log records for reprocessing
Restore your databases from the backup taken closest to the failure point
Gather all the RXX file names/generations from the closest spill job before the backup until the most recent manual spill mentioned above
Create the Forward Recovery job using the time of the spill job that created the earliest RXX as the "from" time, and you can use the current time for the "to" time, to ensure you have all the updates.
Note that because the Recovery command timestamps only have a granularity of one second, and because today's processors can perform many thousands of transactions in each second, it is not possible, for example, to tell the Recovery to skip records 1 to 4873 and start with record 4874 for recovery. This is the reason for using UPDATE=NO. This parameter says that it is acceptable to have errors reported because the "before" image on the RXX does not match the data in the database, which would be the case for all RXX records taken before the backup. These are not database errors, but Recovery processing errors; they would be expected for all RXX records at the beginning of the file until DBUTLTY gets to the first record on the RXX where the "before" image matches the database. From that point forward, all RXX transactions would be processed until the "to" time is reached.
Once recovery is complete, you can restart the MUF as usual, so it would then process those in-flight transactions that were skipped earlier. Now your database recovery is complete to the point of the failure.

So for your questions:

[re: use of LOCK and UNLOCK] These do not prevent updates from happening during the backup process. As has already been discussed, since any changes are captured on the LXX, a recovery will successfully handle the changes.
[re: Recovery starting time] As mentioned above, the "from" time should be the time of the spill job that created the first RXX file used in the recovery process.
[re: multiple updates and backouts] Since all updates are processed in the correct sequence as they originally happened, there is no concern about updates then backouts then further updates.
[re: Recovery errors and inconsistent data] As discussed above, the errors are only in the Recovery process due to Recovery encountering RXX "before" images that do not match the current database state. Once the correct starting point is found, there should be no more errors. Regarding CA 7 and these errors, CA 7 is not involved, as these are only in the recovery process which will complete before CA 7 is started.
[re: Quiesce] In general, there is no absolute control over when the Quiesce takes effect. If you issue the Quiesce TXN while a task is performing a long activity, the Quiesce function will wait until that task is complete or reaches a syncpoint. This wait time is unknown. CA 7 was designed to keep the length of each work unit as small as possible, so that should not necessarily be a factor.
In an ideal situation, you would freeze all your processing to take a backup, and while the Quiesce function provides this, be aware that you also stop all database updates until you turn off the Quiesce. So if the backup job runs for 20 minutes, all your applications will still appear to be running, but they will be in a wait state for those 20 minutes until the database is available for update once again. This means that your applications will have a 20-minute delay when every backup runs, all the initiators will be held up, and all the jobs (and users) that need the database will be executing, but likely swapped out waiting on the database availability. Many of our CA 7 customers do not use this.

In general, using the SEQ=PHYSICAL backup allows a hot backup to be taken quickly and without application interruption. Then, using the backup along with the RXX files, it is possible to recover the database completely and correctly to the desired point in time.

Additional Information

For more information about Forward or Backward recovery, please see the Datacom Core documentation:

Datacom/DB Database and System Administration section "Using Recovery"

Datacom/DB DBUTLTY Reference section "RECOVERY (Rebuild a Database)"

Also see the Knowledge Base articles mentioned in the introduction:

Knowledge Base article 18722, titled "Overview of Datacom Forward and Backward Recovery"

Knowledge Base article 96542, titled "CA Workload Automation CA 7 Edition r12.0/12.1 Hot and Stable Backup and Recovery Procedures"

As always, please contact Broadcom support for Datacom if you have further questions.