Datacom DR: Recovery & Crash Test Guide
search cancel

Datacom DR: Recovery & Crash Test Guide

book

Article ID: 436333

calendar_today

Updated On:

Products

Datacom Datacom/AD Datacom/DB

Issue/Introduction

This article outlines how Broadcom Datacom on z/OS handles various failure scenarios, provides a high-level game plan for recovery, and offers a reliable, repeatable method for testing DR procedures without compromising the data.

Symptoms / Keywords:

  • RC 74(81) during MUF startup or restart
  • Disaster Recovery (DR) testing best practices
  • Forward Recovery, DBUTLTY SPILL, DBUTLTY RECOVERY
  • BCP/DR architecture

(Note: When users improperly simulate a crash—such as deleting database files before starting the MUF—the MUF cannot open the files required to back out or reapply pending transactions during Restart processing, often resulting in an RC 74(81) error. The testing methodology below provides a better alternative to this.)

Resolution

Note: This is a mid-level explanation and may require additional details tailored to your specific environment.

What happens when things break?

Understanding how the Datacom Multi-User Facility (MUF) and your applications respond to different outages is the first step in recovery planning:

Application CPU Failure:

  • If the application is running: The application terminates abnormally. Datacom MUF detects the loss of the application connection and automatically rolls back any in-flight, uncommitted transactions, ensuring data consistency.
  • If the application is down: No immediate database impact.

Datacom MUF CPU Failure:

  • If MUF is running: The MUF address space terminates abnormally. Upon restart, MUF automatically scans the Log Area (LXX) and rolls back any in-flight transactions that were interrupted by the failure.
  • If MUF is down: No immediate database impact.

Disk Device Failure for Application Databases:

  • Loss of the DASD housing the application data requires restoring the affected data areas from the last known valid backup and rolling forward using the archived Recovery Files (RXX) to catch up to the moment things failed.

Disk Device Failure for Core Datacom Files:

  • Directory (CXX) or Force Area (FXX): These can be restored from backups.
  • Log Area (LXX): If the active LXX is destroyed before being spilled to the RXX, any in-flight or unsaved transactions sitting in that active log are gone. Forward recovery is only possible up to the last successful LXX spill to an RXX.

What about a true disaster (losing the whole data center)?

If the primary site goes completely dark, your recovery at the DR site depends on your infrastructure setup:

Active Data Replication (e.g., IBM PPRC, EMC SRDF):

  • If your storage team actively mirrors everything to the DR site, waking up Datacom looks like a standard MUF crash, significantly reducing your Recovery Time Objective (RTO).
  • Crucial Requirement: The CXX, LXX, RXX, and all Database/Index volumes must be mirrored together as a "consistency group." If they are out of sync, the database could be corrupted.
  • System Behavior: Datacom treats the mirrored DASD identically to a MUF CPU failure. Upon startup, the MUF executes an emergency restart, scans the mirrored LXX, and automatically backs out interrupted in-flight transactions.

Offsite Tape / Virtual Tape Vaulting:

  • If your architecture relies on vaulted backups without active DASD replication, you are performing a "cold" restore. Your Recovery Point Objective (RPO) is limited to the last archived RXX successfully transmitted offsite.
  • System Behavior: Requires a complete "cold" recovery. You must restore the CXX and core databases from the last offsite backup and execute a Forward Recovery applying all offsite RXX files in sequence.
  • Note: In a true disaster, do not attempt to salvage the last few seconds of data at the primary site. Work with known safe data at the DR site to guarantee integrity.

The Game Plan: How to recover the applications

When restoring a database and rolling it forward, follow these steps:

  1. Stop the bleeding: Shut down any batch jobs or online regions connected to the database to prevent further corruption.
  2. Dump the log (if possible): If the MUF is still functional, run a DBUTLTY SPILL to write all completed, unarchived transactions to the RXX.
  3. Lay down the backup: Use DBUTLTY LOAD FORMAT=BACKUP to restore your last known good backup.
  4. Fast forward to the present: Run DBUTLTY RECOVERY OPTION=FORWARD using all RXX archive files in chronological order (oldest to newest).
  5. Double-check and go: Verify the data, restart the MUF, and allow applications to verify their systems.

How to test this process

To run a realistic, repeatable "crash test":

  1. Yank the plug: Cancel the MUF while it is processing normally to simulate a hard crash.
  2. Take a snapshot: While the MUF is down, run a DBUTLTY EXTRACT of each table to create a logical "current state" baseline.
  3. Run the recovery play: Attempt a final DBUTLTY SPILL to grab data from the active log, then perform the full recovery (restore an older backup and roll forward using RXX files).
  4. Check your work: Run another DBUTLTY EXTRACT on the recovered tables and perform a file compare against the baseline from Step 2.

Quick Terminology Cheat Sheet:

  • RPO (Recovery Point Objective): The maximum acceptable data loss measured in time.
  • RTO (Recovery Time Objective): The maximum acceptable time to get systems back online.
  • PPRC / SRDF: Hardware-level storage replication brands (IBM/EMC).
  • Consistency Group: A storage grouping that ensures snapshots across multiple drives are taken at the exact same fraction of a second.

Additional Information

For more information about Datacom Forward and Backward recovery processing, please see KB 18722, titled Overview of Datacom Forward and Backward Recovery

Please also see KB 279845 for Tips for Datacom Backups, Log Spills, and Forward Recovery

As always, please contact Broadcom support for Datacom if you have further questions.