Note: This is a mid-level explanation and may require additional details tailored to your specific environment.
What happens when things break?
Understanding how the Datacom Multi-User Facility (MUF) and your applications respond to different outages is the first step in recovery planning:
Application CPU Failure:
- If the application is running: The application terminates abnormally. Datacom MUF detects the loss of the application connection and automatically rolls back any in-flight, uncommitted transactions, ensuring data consistency.
- If the application is down: No immediate database impact.
Datacom MUF CPU Failure:
- If MUF is running: The MUF address space terminates abnormally. Upon restart, MUF automatically scans the Log Area (LXX) and rolls back any in-flight transactions that were interrupted by the failure.
- If MUF is down: No immediate database impact.
Disk Device Failure for Application Databases:
- Loss of the DASD housing the application data requires restoring the affected data areas from the last known valid backup and rolling forward using the archived Recovery Files (RXX) to catch up to the moment things failed.
Disk Device Failure for Core Datacom Files:
- Directory (CXX) or Force Area (FXX): These can be restored from backups.
- Log Area (LXX): If the active LXX is destroyed before being spilled to the RXX, any in-flight or unsaved transactions sitting in that active log are gone. Forward recovery is only possible up to the last successful LXX spill to an RXX.
What about a true disaster (losing the whole data center)?
If the primary site goes completely dark, your recovery at the DR site depends on your infrastructure setup:
Active Data Replication (e.g., IBM PPRC, EMC SRDF):
- If your storage team actively mirrors everything to the DR site, waking up Datacom looks like a standard MUF crash, significantly reducing your Recovery Time Objective (RTO).
- Crucial Requirement: The CXX, LXX, RXX, and all Database/Index volumes must be mirrored together as a "consistency group." If they are out of sync, the database could be corrupted.
- System Behavior: Datacom treats the mirrored DASD identically to a MUF CPU failure. Upon startup, the MUF executes an emergency restart, scans the mirrored LXX, and automatically backs out interrupted in-flight transactions.
Offsite Tape / Virtual Tape Vaulting:
- If your architecture relies on vaulted backups without active DASD replication, you are performing a "cold" restore. Your Recovery Point Objective (RPO) is limited to the last archived RXX successfully transmitted offsite.
- System Behavior: Requires a complete "cold" recovery. You must restore the CXX and core databases from the last offsite backup and execute a Forward Recovery applying all offsite RXX files in sequence.
- Note: In a true disaster, do not attempt to salvage the last few seconds of data at the primary site. Work with known safe data at the DR site to guarantee integrity.
The Game Plan: How to recover the applications
When restoring a database and rolling it forward, follow these steps:
- Stop the bleeding: Shut down any batch jobs or online regions connected to the database to prevent further corruption.
- Dump the log (if possible): If the MUF is still functional, run a DBUTLTY SPILL to write all completed, unarchived transactions to the RXX.
- Lay down the backup: Use DBUTLTY LOAD FORMAT=BACKUP to restore your last known good backup.
- Fast forward to the present: Run DBUTLTY RECOVERY OPTION=FORWARD using all RXX archive files in chronological order (oldest to newest).
- Double-check and go: Verify the data, restart the MUF, and allow applications to verify their systems.
How to test this process
To run a realistic, repeatable "crash test":
- Yank the plug: Cancel the MUF while it is processing normally to simulate a hard crash.
- Take a snapshot: While the MUF is down, run a DBUTLTY EXTRACT of each table to create a logical "current state" baseline.
- Run the recovery play: Attempt a final DBUTLTY SPILL to grab data from the active log, then perform the full recovery (restore an older backup and roll forward using RXX files).
- Check your work: Run another DBUTLTY EXTRACT on the recovered tables and perform a file compare against the baseline from Step 2.
Quick Terminology Cheat Sheet:
- RPO (Recovery Point Objective): The maximum acceptable data loss measured in time.
- RTO (Recovery Time Objective): The maximum acceptable time to get systems back online.
- PPRC / SRDF: Hardware-level storage replication brands (IBM/EMC).
- Consistency Group: A storage grouping that ensures snapshots across multiple drives are taken at the exact same fraction of a second.