DLM Chapter 10: Recovery

advertisement
Chapter 10
Recovery
Overview
Recovery is the process of restoring your database to a valid state after some type of
failure. There are two types of failure:
Table 10-1: System and media failures and what recovery requires
This type of failure
Occurs when . . .
And requires . . .
System failure
the CPU or the Kernel
terminates abnormally
minimal work—most recovery
tasks are taken care of
automatically by the Kernel.
Media failure
a disk crashes, a file becomes backup copies and several steps
damaged, or a faulty program to accomplish recovery.
or user corrupts the data.
Recovery  457
Recovery from a System Failure
An example of the need for recovery from a system failure is when you change a record
but the Kernel crashes while committing the changes to the database. When the Kernel
terminates abnormally, an error message such as the one below is issued:
USER ERROR( 4006).
kernel name.
No active kernel exists for the given
Recovery is fairly easy due to the fact that, before beginning a transaction, the Kernel
saves in the transaction log “before” images of the records it is about to change. The
system administrator needs only to restart the Kernel with DMSA:
DMSA> BEGIN KERNEL=DMKA1 PW=kernelpassword
Upon restarting, the Kernel checks the transaction log and restores the original record to
the database. You now need to complete the processing of records you were working on
when the failure occurred. Consider the following when deciding which transactions to
complete:

If you were performing or your program was performing a FINISH or ABORT when
the failure occurred, the FINISH/ABORT processing was completed.

If you or your program had not reached a FINISH or ABORT command at the time
of the failure, the processing was not completed.
Finish processing consists of compressing deleted record space and deleting index terms
and references. ABORT processing consists of restoring “before” images of deleted or
modified records, deleting added records, and deleting added index terms and references.
Essentially, recovery processing involves a setup phase, during which the Kernel is only
performing recovery, and an online phase, during which the Kernel is available for user
sign-ons.
The Setup Phase
The set-up phase establishes the environments for the transactions to be completed.
Exclusive locks are set on the databases involved in transactions. If two transactions to
be recovered have the same database opened, then one will be queued via the database
lock until the other is complete.
Because much of the Kernel I/O is deferred, b-tree structures need to be checked for
correctness and mended if necessary. Operations which added or deleted index terms
need to be reflected in the b-trees. For all the index terms changed during a transaction,
recovery will verify their reference counts.
458  Recovery
Reference verification requires reading all of the references for each index term and
updating the terms count. This is done only once for each term no matter how many
references were changed. However, this can take a great deal of time if there are a large
number of references. Reference count verification takes place only during the setup
phase.
Once the environments for the transaction FINISH/ABORT requests have been
established, the following messages will appear in the Kernel’s output file:
"I 1559 Recovery setup phase finished."
Which will be followed by the messages:
"I 1515 Kernel Number xx started on Mon dd, yyyy hhmmss"
"I
with xx IPC lines and xx NET lines."
The Kernel is now ready for user sign-ons. The Sort Kernel also is now running.
The Online Phase
In the online phase of recovery, the ABORT/FINISH transaction requests set up by
recovery will be processed as special users. Each will be allocated a communication line
to indicate its existence. The DMSA command SHOW/SIGNONS will list these lines as
well as the Sort Kernel and any connected users. The DMSA DISCONNECT line
command will have no effect on the recovery lines.
Note: The recovery setup phase establishes exclusive locks on the databases involved
in recovering transactions. Users now, in the online phase, signing-on to the Kernel and
attempting to open those databases will have to wait for the recovery transactions to
complete.
If a thesaurus database or SYSRES is involved in a recovering transaction and your open
database request causes an implicit open of one of these databases, you will have to wait
for completion of that transaction.
Forgoing Recovery
The setup phase of recovery can take quite a long time depending on the size and number
of transactions involved and on the machine resources available. You can tell the Kernel
process is in the recovery setup phase if it is up, is using the CPU, and is issuing I/Os but
cannot be connected to.
You may want to consider forgoing recovery. Before doing this, you must be prepared to
restore all databases open for update and to replay your journals. Even if the databases
were not involved in a transaction at the time the Kernel was destroyed, the database still
must be restored if recovery is not run.
Recovery  459
To find out what databases need to be restored, the SA can examine the Kernel’s output
file at the time the Kernel was destroyed. Near the end of the file, all open databases are
listed.
To forgo recovery, the following files should be removed:
NT
UNIX
VMS
System Log File
%DM_RUNDMK%\SYSLOGn.dml
Recovery Files
%DM_RUNDMK%\kun000xxx.dmt
Result Set Files
%DM_RUNDMK%\kun00yyyy.dmt
Result Set Reference Files
%DM_RUNDMK%\kun00zzzz.dmt
System Log File
$DM_RUNDMK/SYSLOGn.dml
Recovery Files
$DM_RUNDMK/kun000xxx.dmt
Result Set Files
$DM_RUNDMK/kun00yyyy.dmt
Result Set Reference Files
$DM_RUNDMK/kun00zzzz.dmt
System Log File
DM$RUNDMK:SYSLOGn.DML
Recovery Files
DM$RUNDMK:KUn000xxx.DMT
Result Set Files
DM$RUNKMK:KUn00yyyy.DMT
Result Set Reference Files
DM$RUNDMK:KUn00zzzz.DMT
where italicized letters signify the following values:
460  Recovery
Variable
Equals
n
the kernel number
xxx
000:511
yyyy
1024:1535
zzzz
512:1023
Result Set Files and Result Set Reference Files are not used in recovery but still should be
removed for the Kernel to operate properly once restarted. Note that “removed” does not
necessarily mean “deleted.” You may want to rename the files to another location in case
you change your mind about forgoing recovery.
You can also skip recovery for selected databases by simply deleting or renaming one of
the database files required for the database. The Kernel will issue an error message that a
database file is missing and that the database cannot be recovered. When skipping
recovery, you must restore the database from backup and replay the journals (see below).
Recovery from a Media Failure
A media failure means either

a disk crashes or a file becomes damaged, or

a faulty program or a user corrupts the data.
The first action you should take upon media failure is to backup the Definition Database
(DDB) and Record Database (RDB) files and their associated journals, even though they
may be damaged. Some of these files may be needed later, and the backup will also allow
you to start over if necessary.
Further steps to take to recover from the two types of media failure above are detailed in
the sections that follow. The success of your recovery largely depends upon your backup
strategy. For information about backup strategies, see Database Definition and
Development, “Providing Backup and Recovery Capability for a Complex SDM.”
Recovery  461
Restoring Databases
When restoring databases, you must first consider what was damaged: the DDB, the
RDB, or both.
Only the DDB is Damaged
If the DDB is the only file in the database that was damaged, and no changes have been
made to the DDB since the last backup, take the following steps:
1.
Restore the most recent backup of the DDB.
2.
Restore any DDB journals backed up since the DDB backup. Normally, you would
have only the current journals if your backup jobs back up all files when journals
switch. If not, insure that you don’t overwrite the current journals with the
restored journals.
3.
Replay the journals with the use of DMJ. List all of the journal files containing
updates to the DDB since the backup on the JOURNAL parameter. DMJ will
determine the correct order.
Only the RDB is Damaged
Note:
DMQ and FQM/BCL are not available in BASIS on Windows.
If one or more of the RDB files is damaged, but the DDB, DDB journals, and RDB
journals are undamaged, and no changes have been made to the DDB since the last
backup, take the following steps:
462  Recovery
1.
Restore the most recent backup of the RDB.
2.
Restore any RDB journals created and backed up since the RDB backup. Normally
you would have only the current journals (journal A and journal B) if your backup
jobs are set up to back up all files when journals switch. If not, ensure that you don’t
overwrite the current journals with the restored journals.
3.
Reproduce RDB updates:
-
If all updates were done with OPEN/DB—that is, there was no use of DMR
ACTION={DROP | CREATE | COMPRESS | DEFRAG}, HVU
OPEN/DIRECT, DMQ ACTION=UPDATE, or other FQM/BCL updates
with JOURNAL=NO—replay RDB journals with DMJ. List all of the
journal files containing updates to the DDB since the backup on the
JOURNAL parameter. DMJ will determine the correct order.
-
If some updates but not all were done with OPEN/DB processing, a backup
of the database should exist from a time after the non-OPEN/DB processing.
Double-check to make sure it exists. If it does, locate the appropriate
backups and follow steps 1 and 2 above. If not, restoration will not be
straightforward, and consistency of data cannot be guaranteed since the mix
of OPEN/DB and OPEN/DIRECT operations previously performed must be
repeated in exactly the same sequence.
If the backup of the database does not exist, backup procedures have not
been defined or executed properly, and you need to take great care to control
manually how the restoration is conducted. The only way this can be done is
iteratively to replay the journals up to the sync point where non-OPEN/DB
processing occurred, repeat the non-OPEN/DB processing, and then repeat
these steps until complete.
a.
b.
Using DMJ, iteratively replay the journals in order of age, beginning
with the oldest, and perform non-OPEN/DB processing. Repeat until
all updates since the last backup are complete.
1.
Use DMJ to replay the journals up to the first non-OPEN/DB
sync point.
2.
Repeat the non-OPEN/DB action with HVU, DMQ, or DMR.
3.
Check results—use FINDs and displays, LOOKs, BROWSEs, or
other facilities as necessary—to ensure that the sequence of
operations is, to the best of your knowledge, correct.
Create a backup of the whole database and reset all journals.
When complete, if all results are not satisfactory, but what you have is the best data
available, you should at least rebuild all your indexes. If the data records are suspect, a
full dump (or salvage) and reload may be required.
Recovery  463
Both the DDB and RDB are Damaged
If no ADM or SDM restructure changes (that is, RES-N/A or RES changes) were made
since the last backup, restore the DDB and replay DDB journals, and then restore the
RDB and replay RDB journals, as described above. Otherwise, backup procedures have
not been defined or executed properly, and you need to take great care to control
manually how the restoration is conducted.
1.
Restore the DDB and journals as described above for DDBs.
2.
Restore the RDB and journals as described above for RDBs.
3.
For each sync point in the DDB journals, do the following:
a.
Replay the DDB journals with DMJ to the next sync point.
b.
Replay the RDB journals with DMJ to the corresponding sync point in
the DDB. The same considerations described above for nonOPEN/DB processing apply here as well.
c.
Check results—use FINDs and displays, LOOKs, BROWSEs, or other
facilities as necessary—to ensure that the sequence of operations is, to
the best of your knowledge, correct.
When complete, if all results are not satisfactory, but what you have is the best data
available, you should at least rebuild all your indexes. If the data records are suspect, a
full dump (or salvage) and reload may be required.
Backup and Recovery of Different Versions of a Database
Use the DMJ utility’s VERSION parameters to specify a version other than the default
(version 0). Be particularly aware of the version suffixes given to the various database
file names.
Recovering a Multiple-File Database
464  Recovery
1.
If the Definition Database has been damaged, restore the most recent copy of it.
2.
If the Definition Database has been modified since the backup date and Definition
Database journals are available, use DMJ to replay the changes to the Definition
Database.
3.
Restore the backup copies of the damaged file sets.
4.
Restore the necessary journals.
5.
Use DMJ to replay the journals against the database. Only the changes related to the
damaged files are replayed.
Recovering from Data Corruption
On occasion it may be necessary to remove updates from your database, an option
facilitated only if you have specified SAVE_BEFORE_IMAGES=YES in the JOURNAL
definition. (If you have not specified such in your Journal definition, see “Backing Out
from Journals That Do Not Have ‘Before’ Images” later in this section.)
SAVE_BEFORE_IMAGES=YES allows you to use the BACKOUT action with DMJ.
Backing Out to a Sync Point
The simplest procedure for backing out the most recent updates in the current journal is
the following:
1.
Use DMJ ACTION=SYNCLIST, JOURNAL=file spec for each journal file to find
out which journal is currently being used.
=> DMJ ACTION=SYNCLIST,JOURNAL=file spec for journal
E1MC.OJB
21-Dec-1988 14:45:00
21-Dec-1988 14:45:50
21-Dec-1988 14:47:28
21-Dec-1988 14:48:25
21-Dec-1988 14:48:29
21-Dec-1988 14:50:29
NORMAL TERMINATION - DMJ
2.
Use DMJ ACTION=BACKOUT, JOURNAL=file spec for each journal file to back
the transactions out of the database.
=> DMJ UID=uid,UPW=upw,DB=db,ACTION=BACKOUT, +
JOURNAL=file spec for journal
All committed images up to the sync point at 21-Dec-1993
14:45:00 were backed out from the database.
NORMAL TERMINATION - DMJ
The changes you just made to the database will have been removed and the database will
be as if the updates had never been entered.
Backing Out to a Specific Time
You can also specify a particular synchronization point to back out to, useful when you
don’t want to back out all the way to the beginning of the journal.
1.
Determine exactly when—which synchronization point from the SYNCLIST—you
want to back out to (see step 1 of “Backing Out to a Sync Point”). You may have to
estimate which synchronization point represents the update you’re interested in.
Recovery  465
2.
Use DMJ ACTION=BACKOUT, JOURNAL=file spec for each journal file and
include the TIME and DATE parameters for the sync point
=> DMJ UID=uid,UPW=upw,DB=db,ACTION=BACKOUT, +
JOURNAL=file spec for journal,TIME=144500, +
DATE=21-Dec-93
DMJ
V1 R8
870928 LIB(D630 S108 A160) [E1D]
All committed images up to the sync point at 21-Dec-1988
14:45:50 were backed out from the database.
NORMAL TERMINATION - DMJ
Backing Out Transactions from Restored Journals
If you discover that several days’ or weeks’ updates are bad and that the journals have
since been backed up and are in backup sets, you can still restore the journal files and
back out the transactions to a specific synchronization point.
1.
Restore the journals from their backup sets into files.
2.
Use DMJ to back out the transactions in these files (see step 2 of “Backing Out to a
Specific Time”). Back out the most recent journal file first and then move “backward
in time,” back to the point you determine that the bad data is removed.
Backing Out Transactions from Journals That Do Not Have “Before”
Images
If you did not specify that the journals save before-images, you can do an effective
backout by restoring an old copy of the database and journals and replaying forward up to
the point you would have backed out to.
Data OK up to here
Data bad after here
Time
You can REPLAY "forward" if
you did not save before-images
Figure 10-1: REPLAY versus BACKOUT
466  Recovery
or BACKOUT "backward" if
you have before-images
Salvage
A time may come when a database is marked as damaged and no backups exist. In this
case, one must resort to dumping and reloading the database. However, if the database is
damaged, attempts to export data may be blocked with some system error. In this case,
you may attempt a SALVAGE operation.
SALVAGE is one of the INTENT options on the OPEN command in HVU. If you open
the database with this intent, the Kernel will ignore any damaged indicators in the
database files, allowing you to attempt an export. This does not always mean the export
will succeed, however.
SALVAGE should be used only as a last resort because the export file it produces may
contain bad data, depending on the nature of the error causing the damage in the first
place. Carefully examine the export file, particularly the records you suspect may have
caused a problem.
When exporting data, the Kernel can access records in several ways:

It can use the index of the primary key

It can use the index of some other field

Or it can use the record type index
Each of these paths can be tried in order to get a successful SALVAGE dump.

To use the primary key index, create the result set with an ORDER BY primary key

To use another index, create the set using an ORDER BY field

To use the record type index, create the set with only a FIND view
If SALVAGE cannot export the data and no backups exist, the only option is to
reinitialize the database and rebuild.
Backup and Recovery Guidelines and Considerations
For more information about backing up your databases, see Database Definition and
Development, “Providing Backup and Recovery Capability for a Complex SDM.”

Use the operating system’s utilities to manage backups.

A site should have scheduled backups—don’t depend on the System backups.
Recovery  467

Guidelines for backups when you are managing them yourself are
1.
Write a script to execute DMSA to DEACTIVATE the database and DISCONNECT
the users. Everyone must be out of the database during backups. If not, when the
database is restored, the backup files will be damaged or unreadable.
DMSA> ASSIGN/DB DB=dbname DEACTIVATE=YES
DMSA> DISCONNECT KERNEL=kernel_name DB=dbname
DMSA> EXIT
2.
Issue the appropriate operating system commands to backup the files. To identify
these files, examine the DDL for the relevant database or run DMDDBE on the
relevant database with ACTION=ONE and OBJECT_TYPE=FILES_ALL to extract
the list of files.
3.
For ease of use, all database files (.DDB and .RDB) in a particular directory should
go into the same save set.
4.
Journal files should be put in a save set different from that of the database files.
5.
Once all the database files and journal files are backed up, run DMJ to reset the
journal files:
DMJ DB=db_name ACTION=RESET JOURNALS=ALL
6.
When the backup is complete, reactivate the database:
DMSA UID=xxxxx UPW=xxxxx AIDS=NO
ASSIGN/DB DB=db_name DEACTIVATE=NO

How often should you do a backup? The answer depends on how quickly you must
recover. The larger the journal file, the longer the recovery process will take to
complete.

Journal files should be sufficiently large enough so that they will never fill up before
the next backup. If this happens, the database will be unusable until the journals are
reset.

If you want to be notified when Journal A fills up, the A_BACKUP_JOB parameter
on the SDM JOURNAL statement can be used. Use this parameter to send a
message to yourself when the journal needs to be backed up and reset. The same
should be done for Journal B using the B_BACKUP_JOB parameter.
Note: A_BACKUP_JOB and B_BACKUP_JOB parameters and DMDBA screen
mode are not available on Windows. Journal backups must be performed using
Windows operating system utilities.
468  Recovery

When a recovery is necessary, restore the most recent backup set of the .DDB and
.RDB files, but not the journals. The journal files on the disk are the most current.
Use them to Replay the journals (via DMJ) to bring the database back into sync.

Run DMJ:
DMJ UID=xxxxx UPW=xxxxx DB=db_name ACTION=REPLAY +
JOURNALS=db_name.RJA db_name.RJB

The file sets should be backed up whenever changes are made to the DDB that cause
the following message to be displayed: “The file sets that have changed should be
backed up so that media recovery can be done successfully.” Some changes can be
made to a Definition Database that will make the most recent backup usable if it is
needed for recovery. An example of this would be changing from a numeric field to
a larger character field in DMDBA.
Recovery  469
470  Recovery
Download