Chapter 10 Recovery Overview Recovery is the process of restoring your database to a valid state after some type of failure. There are two types of failure: Table 10-1: System and media failures and what recovery requires This type of failure Occurs when . . . And requires . . . System failure the CPU or the Kernel terminates abnormally minimal work—most recovery tasks are taken care of automatically by the Kernel. Media failure a disk crashes, a file becomes backup copies and several steps damaged, or a faulty program to accomplish recovery. or user corrupts the data. Recovery 457 Recovery from a System Failure An example of the need for recovery from a system failure is when you change a record but the Kernel crashes while committing the changes to the database. When the Kernel terminates abnormally, an error message such as the one below is issued: USER ERROR( 4006). kernel name. No active kernel exists for the given Recovery is fairly easy due to the fact that, before beginning a transaction, the Kernel saves in the transaction log “before” images of the records it is about to change. The system administrator needs only to restart the Kernel with DMSA: DMSA> BEGIN KERNEL=DMKA1 PW=kernelpassword Upon restarting, the Kernel checks the transaction log and restores the original record to the database. You now need to complete the processing of records you were working on when the failure occurred. Consider the following when deciding which transactions to complete: If you were performing or your program was performing a FINISH or ABORT when the failure occurred, the FINISH/ABORT processing was completed. If you or your program had not reached a FINISH or ABORT command at the time of the failure, the processing was not completed. Finish processing consists of compressing deleted record space and deleting index terms and references. ABORT processing consists of restoring “before” images of deleted or modified records, deleting added records, and deleting added index terms and references. Essentially, recovery processing involves a setup phase, during which the Kernel is only performing recovery, and an online phase, during which the Kernel is available for user sign-ons. The Setup Phase The set-up phase establishes the environments for the transactions to be completed. Exclusive locks are set on the databases involved in transactions. If two transactions to be recovered have the same database opened, then one will be queued via the database lock until the other is complete. Because much of the Kernel I/O is deferred, b-tree structures need to be checked for correctness and mended if necessary. Operations which added or deleted index terms need to be reflected in the b-trees. For all the index terms changed during a transaction, recovery will verify their reference counts. 458 Recovery Reference verification requires reading all of the references for each index term and updating the terms count. This is done only once for each term no matter how many references were changed. However, this can take a great deal of time if there are a large number of references. Reference count verification takes place only during the setup phase. Once the environments for the transaction FINISH/ABORT requests have been established, the following messages will appear in the Kernel’s output file: "I 1559 Recovery setup phase finished." Which will be followed by the messages: "I 1515 Kernel Number xx started on Mon dd, yyyy hhmmss" "I with xx IPC lines and xx NET lines." The Kernel is now ready for user sign-ons. The Sort Kernel also is now running. The Online Phase In the online phase of recovery, the ABORT/FINISH transaction requests set up by recovery will be processed as special users. Each will be allocated a communication line to indicate its existence. The DMSA command SHOW/SIGNONS will list these lines as well as the Sort Kernel and any connected users. The DMSA DISCONNECT line command will have no effect on the recovery lines. Note: The recovery setup phase establishes exclusive locks on the databases involved in recovering transactions. Users now, in the online phase, signing-on to the Kernel and attempting to open those databases will have to wait for the recovery transactions to complete. If a thesaurus database or SYSRES is involved in a recovering transaction and your open database request causes an implicit open of one of these databases, you will have to wait for completion of that transaction. Forgoing Recovery The setup phase of recovery can take quite a long time depending on the size and number of transactions involved and on the machine resources available. You can tell the Kernel process is in the recovery setup phase if it is up, is using the CPU, and is issuing I/Os but cannot be connected to. You may want to consider forgoing recovery. Before doing this, you must be prepared to restore all databases open for update and to replay your journals. Even if the databases were not involved in a transaction at the time the Kernel was destroyed, the database still must be restored if recovery is not run. Recovery 459 To find out what databases need to be restored, the SA can examine the Kernel’s output file at the time the Kernel was destroyed. Near the end of the file, all open databases are listed. To forgo recovery, the following files should be removed: NT UNIX VMS System Log File %DM_RUNDMK%\SYSLOGn.dml Recovery Files %DM_RUNDMK%\kun000xxx.dmt Result Set Files %DM_RUNDMK%\kun00yyyy.dmt Result Set Reference Files %DM_RUNDMK%\kun00zzzz.dmt System Log File $DM_RUNDMK/SYSLOGn.dml Recovery Files $DM_RUNDMK/kun000xxx.dmt Result Set Files $DM_RUNDMK/kun00yyyy.dmt Result Set Reference Files $DM_RUNDMK/kun00zzzz.dmt System Log File DM$RUNDMK:SYSLOGn.DML Recovery Files DM$RUNDMK:KUn000xxx.DMT Result Set Files DM$RUNKMK:KUn00yyyy.DMT Result Set Reference Files DM$RUNDMK:KUn00zzzz.DMT where italicized letters signify the following values: 460 Recovery Variable Equals n the kernel number xxx 000:511 yyyy 1024:1535 zzzz 512:1023 Result Set Files and Result Set Reference Files are not used in recovery but still should be removed for the Kernel to operate properly once restarted. Note that “removed” does not necessarily mean “deleted.” You may want to rename the files to another location in case you change your mind about forgoing recovery. You can also skip recovery for selected databases by simply deleting or renaming one of the database files required for the database. The Kernel will issue an error message that a database file is missing and that the database cannot be recovered. When skipping recovery, you must restore the database from backup and replay the journals (see below). Recovery from a Media Failure A media failure means either a disk crashes or a file becomes damaged, or a faulty program or a user corrupts the data. The first action you should take upon media failure is to backup the Definition Database (DDB) and Record Database (RDB) files and their associated journals, even though they may be damaged. Some of these files may be needed later, and the backup will also allow you to start over if necessary. Further steps to take to recover from the two types of media failure above are detailed in the sections that follow. The success of your recovery largely depends upon your backup strategy. For information about backup strategies, see Database Definition and Development, “Providing Backup and Recovery Capability for a Complex SDM.” Recovery 461 Restoring Databases When restoring databases, you must first consider what was damaged: the DDB, the RDB, or both. Only the DDB is Damaged If the DDB is the only file in the database that was damaged, and no changes have been made to the DDB since the last backup, take the following steps: 1. Restore the most recent backup of the DDB. 2. Restore any DDB journals backed up since the DDB backup. Normally, you would have only the current journals if your backup jobs back up all files when journals switch. If not, insure that you don’t overwrite the current journals with the restored journals. 3. Replay the journals with the use of DMJ. List all of the journal files containing updates to the DDB since the backup on the JOURNAL parameter. DMJ will determine the correct order. Only the RDB is Damaged Note: DMQ and FQM/BCL are not available in BASIS on Windows. If one or more of the RDB files is damaged, but the DDB, DDB journals, and RDB journals are undamaged, and no changes have been made to the DDB since the last backup, take the following steps: 462 Recovery 1. Restore the most recent backup of the RDB. 2. Restore any RDB journals created and backed up since the RDB backup. Normally you would have only the current journals (journal A and journal B) if your backup jobs are set up to back up all files when journals switch. If not, ensure that you don’t overwrite the current journals with the restored journals. 3. Reproduce RDB updates: - If all updates were done with OPEN/DB—that is, there was no use of DMR ACTION={DROP | CREATE | COMPRESS | DEFRAG}, HVU OPEN/DIRECT, DMQ ACTION=UPDATE, or other FQM/BCL updates with JOURNAL=NO—replay RDB journals with DMJ. List all of the journal files containing updates to the DDB since the backup on the JOURNAL parameter. DMJ will determine the correct order. - If some updates but not all were done with OPEN/DB processing, a backup of the database should exist from a time after the non-OPEN/DB processing. Double-check to make sure it exists. If it does, locate the appropriate backups and follow steps 1 and 2 above. If not, restoration will not be straightforward, and consistency of data cannot be guaranteed since the mix of OPEN/DB and OPEN/DIRECT operations previously performed must be repeated in exactly the same sequence. If the backup of the database does not exist, backup procedures have not been defined or executed properly, and you need to take great care to control manually how the restoration is conducted. The only way this can be done is iteratively to replay the journals up to the sync point where non-OPEN/DB processing occurred, repeat the non-OPEN/DB processing, and then repeat these steps until complete. a. b. Using DMJ, iteratively replay the journals in order of age, beginning with the oldest, and perform non-OPEN/DB processing. Repeat until all updates since the last backup are complete. 1. Use DMJ to replay the journals up to the first non-OPEN/DB sync point. 2. Repeat the non-OPEN/DB action with HVU, DMQ, or DMR. 3. Check results—use FINDs and displays, LOOKs, BROWSEs, or other facilities as necessary—to ensure that the sequence of operations is, to the best of your knowledge, correct. Create a backup of the whole database and reset all journals. When complete, if all results are not satisfactory, but what you have is the best data available, you should at least rebuild all your indexes. If the data records are suspect, a full dump (or salvage) and reload may be required. Recovery 463 Both the DDB and RDB are Damaged If no ADM or SDM restructure changes (that is, RES-N/A or RES changes) were made since the last backup, restore the DDB and replay DDB journals, and then restore the RDB and replay RDB journals, as described above. Otherwise, backup procedures have not been defined or executed properly, and you need to take great care to control manually how the restoration is conducted. 1. Restore the DDB and journals as described above for DDBs. 2. Restore the RDB and journals as described above for RDBs. 3. For each sync point in the DDB journals, do the following: a. Replay the DDB journals with DMJ to the next sync point. b. Replay the RDB journals with DMJ to the corresponding sync point in the DDB. The same considerations described above for nonOPEN/DB processing apply here as well. c. Check results—use FINDs and displays, LOOKs, BROWSEs, or other facilities as necessary—to ensure that the sequence of operations is, to the best of your knowledge, correct. When complete, if all results are not satisfactory, but what you have is the best data available, you should at least rebuild all your indexes. If the data records are suspect, a full dump (or salvage) and reload may be required. Backup and Recovery of Different Versions of a Database Use the DMJ utility’s VERSION parameters to specify a version other than the default (version 0). Be particularly aware of the version suffixes given to the various database file names. Recovering a Multiple-File Database 464 Recovery 1. If the Definition Database has been damaged, restore the most recent copy of it. 2. If the Definition Database has been modified since the backup date and Definition Database journals are available, use DMJ to replay the changes to the Definition Database. 3. Restore the backup copies of the damaged file sets. 4. Restore the necessary journals. 5. Use DMJ to replay the journals against the database. Only the changes related to the damaged files are replayed. Recovering from Data Corruption On occasion it may be necessary to remove updates from your database, an option facilitated only if you have specified SAVE_BEFORE_IMAGES=YES in the JOURNAL definition. (If you have not specified such in your Journal definition, see “Backing Out from Journals That Do Not Have ‘Before’ Images” later in this section.) SAVE_BEFORE_IMAGES=YES allows you to use the BACKOUT action with DMJ. Backing Out to a Sync Point The simplest procedure for backing out the most recent updates in the current journal is the following: 1. Use DMJ ACTION=SYNCLIST, JOURNAL=file spec for each journal file to find out which journal is currently being used. => DMJ ACTION=SYNCLIST,JOURNAL=file spec for journal E1MC.OJB 21-Dec-1988 14:45:00 21-Dec-1988 14:45:50 21-Dec-1988 14:47:28 21-Dec-1988 14:48:25 21-Dec-1988 14:48:29 21-Dec-1988 14:50:29 NORMAL TERMINATION - DMJ 2. Use DMJ ACTION=BACKOUT, JOURNAL=file spec for each journal file to back the transactions out of the database. => DMJ UID=uid,UPW=upw,DB=db,ACTION=BACKOUT, + JOURNAL=file spec for journal All committed images up to the sync point at 21-Dec-1993 14:45:00 were backed out from the database. NORMAL TERMINATION - DMJ The changes you just made to the database will have been removed and the database will be as if the updates had never been entered. Backing Out to a Specific Time You can also specify a particular synchronization point to back out to, useful when you don’t want to back out all the way to the beginning of the journal. 1. Determine exactly when—which synchronization point from the SYNCLIST—you want to back out to (see step 1 of “Backing Out to a Sync Point”). You may have to estimate which synchronization point represents the update you’re interested in. Recovery 465 2. Use DMJ ACTION=BACKOUT, JOURNAL=file spec for each journal file and include the TIME and DATE parameters for the sync point => DMJ UID=uid,UPW=upw,DB=db,ACTION=BACKOUT, + JOURNAL=file spec for journal,TIME=144500, + DATE=21-Dec-93 DMJ V1 R8 870928 LIB(D630 S108 A160) [E1D] All committed images up to the sync point at 21-Dec-1988 14:45:50 were backed out from the database. NORMAL TERMINATION - DMJ Backing Out Transactions from Restored Journals If you discover that several days’ or weeks’ updates are bad and that the journals have since been backed up and are in backup sets, you can still restore the journal files and back out the transactions to a specific synchronization point. 1. Restore the journals from their backup sets into files. 2. Use DMJ to back out the transactions in these files (see step 2 of “Backing Out to a Specific Time”). Back out the most recent journal file first and then move “backward in time,” back to the point you determine that the bad data is removed. Backing Out Transactions from Journals That Do Not Have “Before” Images If you did not specify that the journals save before-images, you can do an effective backout by restoring an old copy of the database and journals and replaying forward up to the point you would have backed out to. Data OK up to here Data bad after here Time You can REPLAY "forward" if you did not save before-images Figure 10-1: REPLAY versus BACKOUT 466 Recovery or BACKOUT "backward" if you have before-images Salvage A time may come when a database is marked as damaged and no backups exist. In this case, one must resort to dumping and reloading the database. However, if the database is damaged, attempts to export data may be blocked with some system error. In this case, you may attempt a SALVAGE operation. SALVAGE is one of the INTENT options on the OPEN command in HVU. If you open the database with this intent, the Kernel will ignore any damaged indicators in the database files, allowing you to attempt an export. This does not always mean the export will succeed, however. SALVAGE should be used only as a last resort because the export file it produces may contain bad data, depending on the nature of the error causing the damage in the first place. Carefully examine the export file, particularly the records you suspect may have caused a problem. When exporting data, the Kernel can access records in several ways: It can use the index of the primary key It can use the index of some other field Or it can use the record type index Each of these paths can be tried in order to get a successful SALVAGE dump. To use the primary key index, create the result set with an ORDER BY primary key To use another index, create the set using an ORDER BY field To use the record type index, create the set with only a FIND view If SALVAGE cannot export the data and no backups exist, the only option is to reinitialize the database and rebuild. Backup and Recovery Guidelines and Considerations For more information about backing up your databases, see Database Definition and Development, “Providing Backup and Recovery Capability for a Complex SDM.” Use the operating system’s utilities to manage backups. A site should have scheduled backups—don’t depend on the System backups. Recovery 467 Guidelines for backups when you are managing them yourself are 1. Write a script to execute DMSA to DEACTIVATE the database and DISCONNECT the users. Everyone must be out of the database during backups. If not, when the database is restored, the backup files will be damaged or unreadable. DMSA> ASSIGN/DB DB=dbname DEACTIVATE=YES DMSA> DISCONNECT KERNEL=kernel_name DB=dbname DMSA> EXIT 2. Issue the appropriate operating system commands to backup the files. To identify these files, examine the DDL for the relevant database or run DMDDBE on the relevant database with ACTION=ONE and OBJECT_TYPE=FILES_ALL to extract the list of files. 3. For ease of use, all database files (.DDB and .RDB) in a particular directory should go into the same save set. 4. Journal files should be put in a save set different from that of the database files. 5. Once all the database files and journal files are backed up, run DMJ to reset the journal files: DMJ DB=db_name ACTION=RESET JOURNALS=ALL 6. When the backup is complete, reactivate the database: DMSA UID=xxxxx UPW=xxxxx AIDS=NO ASSIGN/DB DB=db_name DEACTIVATE=NO How often should you do a backup? The answer depends on how quickly you must recover. The larger the journal file, the longer the recovery process will take to complete. Journal files should be sufficiently large enough so that they will never fill up before the next backup. If this happens, the database will be unusable until the journals are reset. If you want to be notified when Journal A fills up, the A_BACKUP_JOB parameter on the SDM JOURNAL statement can be used. Use this parameter to send a message to yourself when the journal needs to be backed up and reset. The same should be done for Journal B using the B_BACKUP_JOB parameter. Note: A_BACKUP_JOB and B_BACKUP_JOB parameters and DMDBA screen mode are not available on Windows. Journal backups must be performed using Windows operating system utilities. 468 Recovery When a recovery is necessary, restore the most recent backup set of the .DDB and .RDB files, but not the journals. The journal files on the disk are the most current. Use them to Replay the journals (via DMJ) to bring the database back into sync. Run DMJ: DMJ UID=xxxxx UPW=xxxxx DB=db_name ACTION=REPLAY + JOURNALS=db_name.RJA db_name.RJB The file sets should be backed up whenever changes are made to the DDB that cause the following message to be displayed: “The file sets that have changed should be backed up so that media recovery can be done successfully.” Some changes can be made to a Definition Database that will make the most recent backup usable if it is needed for recovery. An example of this would be changing from a numeric field to a larger character field in DMDBA. Recovery 469 470 Recovery