After Imaging The DBA’s Best Friend A Few Words About The Speaker • • • • • Tom Bascom Progress® User since 1987 White Star Software, LLC DBAppraise®, LLC Consulting Services related to Progress Databases and Application Architecture. tom@wss.com tom@dbappraise.com What is it? and Why Do I Need it? What is After-Imaging? • A journal of transaction “notes” that can be replayed against a baseline backup to restore a database to the last completed transaction or a point in time or a specific transaction number. • This is the same concept that some other databases refer to as the “redo log”. • Differs from the before image file (undo log) as space is not reused without interaction or scripting.* * 10.1B AI Archiver improves this. Why do I need after-imaging? • Protection from media loss -- such as bad tapes, a crashed disk, a destroyed data center or stolen servers… I have backups. Do I still need after-imaging? • With a backup your potential exposure to data loss is the entire time period between backups. • For example -- if you do nightly backups and your disk crashes at 4:45pm you restore from backup and lose an entire day of work. If you have one or more bad tapes your data loss could be much worse. • With after-imaging you restore the same backup, roll-forward your archived ai files and lose only uncommitted transactions. Why else do I need after-imaging? • Protection from human errors: $ cd /db $ rm * for each customer: delete customer. end. for each order: delivered = yes. end. $ vi dbname.db … :x • Human error is at least as big a risk as hardware problems. Isn’t AI the same as disk mirroring? • No, disk mirrors will happily delete both copies of your deleted database. • Or delete all of your customers on both mirrors. Or an Audit Log? • No, an audit log cannot be replayed to reconstruct the missing data. I have OpenEdge Replication. Do I still need after-imaging? • OE Replication is a super-set of after-imaging. You still must configure and manage afterimaging. • After-imaging still provides an additional layer of protection – even with OE Replication in place. • OE Replication is aggressively real-time. You cannot build in a time delay like you can with after-imaging. Are there downsides to after-imaging? • It is not automatically enabled. • You must manage archived logs. • Recovery is not automated. What about performance? • There might be a very small penalty. • But you can usually only measure it under extremely high loads. Loss Prevention Strategies SLA Data Loss Strategy Hardware Loss Strategy Days Nightly Backups • Simple & Inexpensive Service contract • Relatively low % of system cost Hours Multiple online backups during day •More files to keep track of Contract with same-day, on-site repair • More expensive, a long time to wait Many Minutes After Imaging • Moderately complex scripting • Monitoring becomes more critical • Skilled DBA is helpful Some redundant HW • SAN with RAID • Spare parts kept onsite A few Minutes After Imaging • Complex scripting • Monitoring becomes more critical • Skilled DBA is important Warm spare server • Twice the cost of production HW • Ideally in a remote facility • Additional DB licensing costs Seconds Open Edge Replication • Much more complex •Skilled DBA is critical •Monitoring extremely critical Hot spare server & automated fail-over • Twice the cost of production HW • Ideally in a remote facility • Additional DB licensing costs • Additional OS & 3rd party SW costs Balancing Cost vs Lost Data $1,000 Hypothetical Relative Costs of Different SLAs $750 $500 $250 $0 Days Hours Many Minutes Few Minutes Seconds How Does After-Imaging Work? How does after-imaging work? BI File Database DB BI .a1 .a2 .a3 probkup dbname dbname.pbk .a4 AI Logs First, make a backup! How does after-imaging work? Shared Memory BI File Database BIW BI DB AI Logs AIW .a1 .a2 .a3 .a4 busy empty empty empty rfutil dbname –C aimage begin Then, enable afterimaging, start the database and start an AI Writer. Extent .a1 will be “busy”. How does after-imaging work? Shared Memory BI File Database BIW BI DB AI Logs AIW .a1 .a2 .a3 .a4 full busy empty empty rfutil dbname –C aimage new Switch extents. Extent .a1 will be marked “full” and extent .a2 will become “busy”. How does after-imaging work? Shared Memory BI File Database BIW BI DB AI Logs AIW .a1 .a2 .a3 .a4 full full busy empty rfutil dbname –C aimage new Switch extents again. Extent .a2 will be marked “full” and extent .a3 will become “busy”. How does after-imaging work? Shared Memory BI File Database DB BIW AI Logs AIW BI .a1 .a2 .a3 .a4 full full full busy rfutil dbname –C aimage new Once more, switch extents. Extent .a3 will be marked “full” and extent .a4 will become “busy”. How does after-imaging work? Shared Memory BI File Database BIW BI DB AI Logs AIW .a1 .a2 .a3 .a4 full full full busy rfutil dbname –C aimage new Switch… Oops! There are no “empty” extents! All afterimage extents are either “full” or “busy”! How does after-imaging work? Shared Memory BI File Database BIW AI Logs DB AIW .001 BI .a1 .a2 .a3 .a4 full full full busy Copy full extents… Use the extent sequence number to name them. .002 .003 How does after-imaging work? Shared Memory BI File Database BIW AI Logs DB AIW .001 BI .a1 .a2 .a3 .a4 empty empty empty busy Mark the full extents as “empty”. rfutil dbname -C aimage extent empty .002 .003 How does after-imaging work? Shared Memory BI File Database BIW AI Logs DB AIW .001 BI .a1 .a2 .a3 .a4 busy empty empty full rfutil dbname –C aimage new .002 .003 How does after-imaging work? Shared Memory BI File Database BIW AI Logs DB AIW .001 BI ai.sweep .a1 .a2 .a3 .a4 busy empty empty full .002 .003 .004 How does after-imaging work? Shared Memory BI File Database AI Logs DB BIW AIW .001 BI ai.new ai.sweep .a1 .a2 .a3 .a4 full busy empty empty .002 .003 .004 .005 How does after-imaging work? Shared Memory BI File Database BIW BI ai.new ai.sweep AI Logs DB AIW .a1 .a2 .a3 .a4 empty full busy empty .001 .005 .002 .006 .003 … .004 How do I use after-imaging to recover? • Restore from backup. The preferred method is to restore to a dedicated recovery area. DO NOT DESTROY a damaged database without first backing it up. • Determine where to recover to (point in time, transaction id, last archived ai extent...) • Obtain the archived ai extents from the backup point through to the recovery point. • Roll forward the archived extents: rfutil dbname -C roll forward [–endtime yyyy:mm:dd:hh:ss] –a archiveExtent ai.roll dbname startExtent [endExtent] How do I recover using AI? Shared Memory BI File Database DB BI .a1 .a2 .a3 AI Logs /ailogs .001 .005 .002 .006 .003 … .a4 prorest dbname dbname.pbk < backup.list rfutil dbname –C roll forward –a /ailogs/dbname.001 .004 How do I recover using AI? Shared Memory BI File Database DB BI .a1 .a2 .a3 AI Logs /ailogs .001 .005 .002 .006 .003 … .a4 rfutil dbname –C roll forward –a /ailogs/dbname.002 .004 How do I recover using AI? Shared Memory BI File Database DB BI .a1 .a2 .a3 AI Logs /ailogs .001 .005 .002 .006 .003 … .a4 rfutil dbname –C roll forward –a /ailogs/dbname.003 … .004 Post-recovery… • Remember to enable after-imaging. It is disabled on the roll-forward target! What is “Log Based Replication”? • Log Based Replication is a fancy name for using after-image files (“logs”) to maintain a copy of your database. • Uses for Log Based Replication: – Verified Backup – make sure that your archived AI files are valid. – Reporting Database – use “norecover” to create a reporting database. – Warm Spare – keep a copy of your database (almost) ready to go in failover mode. How does Log Based Replication work? /stg BI File Database .001 AI Logs DB /arc .001 BI .a1 .a2 .a3 .a4 rfutil dbname –C roll forward –a /stg/dbname.001 mv /stg/dbname.001 /arc/dbname.001 How does Log Based Replication work? /stg BI File Database .002 AI Logs DB /arc .001 BI .a1 .a2 .a3 .a4 .002 rfutil dbname –C roll forward –a /stg/dbname.002 mv /stg/dbname.002 /arc/dbname.002 How does Log Based Replication work? /stg BI File Database .006 AI Logs DB /arc BI .a1 .a2 .a3 .001 .005 .002 .006 .003 … .a4 rfutil dbname –C roll forward –a /stg/dbname.seq# mv /stg/dbname.seq# /arc/dbname.seq# .004 What about the New! AI Archiver? • The ai archiver is a daemon that automates extent switching and archiving. • New startup parameters allow you to start, stop and configure the ai archiver. • Does not handle off-site archiving, redundant archiving, compression or purging of archived logs. • Uses a hideous file naming convention. • Does not handle recovery. • Does not handle monitoring or alerting. AI Archiver (and some other loosely related features) Command Purpose proutil dbname -C enableaiarchiver Enable ai archiver (offline). probkup online dbname -enableaiarchiver Enable ai archiver (online). -aiarcdir dir -aiarcinterval n [-aiarcdircreate] rfutil dbname -C aiarchiver setarcdir <dir-list> Set or change archive directory(s) rfutil dbname -aiarchiver setinterval # Set or change archive interval (seconds; 120 to 86400). proutil dbname -C addonline [st-file-name] Add extents online. probkup online dbname backupFile -enableai Enable after-imaging online. Practical Matters How often should I switch extents? • How much data can you afford to lose? – Can users re-enter 5 minutes of data? 15? 60? – Can you “replay” external transactions? (EDI interfaces and so forth…) • Is your workload the same 24x7? – Do the answers above vary between a “batch window” and “online activity”? – How about weekends and holidays? • I often find hourly switches at night and every 15 minutes during the day to be a good starting point. How should I setup after-imaging? • Add ai extents: prostrct add dbname ai.st -orprostrct addonline dbname ai.st # ai.st a /ai a /ai a /ai a /ai • How many extents? – 4 is the absolute minimum: • 1 busy, 1 full, 1 empty (plus 1 “locked” if using OE Replication). – 8 is my recommended default: • The “extras” give you time to react to issues. – 16 is my suggested maximum – more is just awkward. Should I use fixed or variable extents? • Variable Length – – – – More flexible. Simpler scripting. Easier monitoring. More time to correct problems. • Fixed Length – Many legacy implementations still use them. – Fixed might be appropriate for very high volume sites. • Recommendation: Use variable length extents. How much disk space do I need? • How much BI space do you use? (How many bi clusters do you close in a period of time?) • How many archived logs should you keep online? • Do you keep disk images of backups online? • What about off-site copies of backups and archived logs? • Do you plan to recover to dedicated recovery disk space or “on top of” the existing database? What sort of disks should I use for AI? • Dedicated disks. – The primary job of after-imaging is to protect against media failure. – Storing after-image files on the same disks as the data extents nullifies that protection! • RAID5 (parity) is probably not your best option: – After-Imaging is, essentially, write-only. – RAID5 disks are performance-challenged when writing. • RAID10 (mirrored stripes) is probably not beneficial: – After-Imaging writes are sequential. • RAID1 (mirroring) is the best choice. AI Implementation Worksheet Item FileSystem Description Extent Switching Schedule M-F, 9-5 Every 15 minutes; hourly otherwise Number & Type of Extents 8, Variable, Dedicated RAID 1 disks AI Extents /ai 8GB (~50 16MB bi clusters per day = 800MB) Archived Logs /ailog /aizip /aistg /aiver 32GB (40 days) 16GB, Zipped logs 8GB, staging area for logs to be verified from 32GB, archive of verified logs Verified Backup /aitest 125GB Backup Strategy /backup 250GB, Backup –norecover from /aitest to disk, then tape Offsite Archives /ailog scp logs to remote server X, 32GB (40 days) Recovery Strategy /recover 250GB (current production db size x 2.5) Warm Spare Strategy Report Server X is an offsite mirror of prod, apply offsite logs continuously /reports 125GB, Restored from /backup nightly How do I start after-imaging? • Backup: – probkup is simpler because it marks the db as “backed up”. – OS backups require an extra manual step: rfutil dbname -C mark backedup • Enable After Imaging: rfutil dbname -C aimage begin • Start an AI Writer (AIW): proaiw dbname How do I manage after-imaging? Script AI Archiver ai.new Yes ai.sweep Partial Description Switches to the next available empty extent. Copies full extents to (multiple, redundant and possibly remote) archive locations. (The AI Archiver only copies archived extents to a single location on the same server.) ai.roll No Rolls forward a set of AI logs against a database. Simplifies roll-forward by grouping files and ignoring “wrong extent” warnings. ai.purge No Purges old archived extents. ai.warm No Applies AI logs that appear in a staging directory to a target database. Used to maintain warm spares and verified backup databases. ai.ready No Checks a warm spare or verified backup database to ensure that AI logs are being properly applied. After-Imaging on UNIX # crontab (source server) # 1,16,31,46 * * * * ai.new cs608 base callb callr invpr >> /logs/ai.log 2>&1 # 2,17,32,47 * * * * ai.sweep cs608 base callb callr invpr >> /logs/ai.log 2>&1 # 0 20 * * * ai.purge cs608 # crontab (target server) # 10,25,40,55 * * * * ai.warm cs608 base > /dev/null # 0 * * * * ai.ready cs608 base callb callr invpr > /tmp/ai.ready.log # 0 20 * * * ai.purge cs608 How should I monitor after-imaging? • • • • • After-imaging should be enabled. Busy extents should be 1. Full extents should be less than or equal to 2. Empty extents should be “most of them”. The last messages in the .lg file of a replicated database should be: (662)Roll forward completed. (334) rfutil -C roll forward session end. (with appropriately recent date and time stamps.) Troubleshooting Extents Stop Switching • You may have disabled cron, the cron job or the ai archiver (if you are using it). • Or you may have introduced a scripting error. • You may have run out of disk space somewhere. • With variable extents in use and “large files” enabled disk space becomes the limiting factor. You have more time to detect, respond to and fix the problem. • With fixed extents the database may stall or crash much sooner. • If you are out of ideas try a manual extent switch. Roll Forward Fails • You may have guessed the wrong extent – this is harmless. Try another. The message in the .lg file tells you which sequence# you need. • An archived extent might be missing or damaged – find a valid copy and try again. This is a good reason to make redundant copies of ai logs. • A more serious error may have occurred. Read the .lg file and check out the error on PSDN if necessary. Use “roll forward retry” after correcting the error. Opening a Replication Target • Once you start a server or open a single-user session against a replication target you cannot roll-forward any more logs. • Even if you change no data. • You can, however, safely start a –RO session. • If someone opens the database you will need to re-initialize the replication target. Forgetting to Enable After-Imaging. • Usually happens after a conversion or a recovery/fail-over. • Add extents online (if necessary). • probkup and enable ai online. • Re-initialize your replication targets. (Re-)Initializing a Replication Target • Move any accumulated staged ai logs to a temporary directory. • Obtain a backup of the source database. • Restore the backup on the target server. • Transfer the 1st needed ai log and all subsequent logs to the staging directory. – An incorrect log will result in a message in the .lg file that identifies the needed sequence#. Why re-initialize? • Failing back from fail-over recovery to your warm spare. • Someone accidentally opened your replication target. • After-imaging was deliberately disabled for some reason. • Dump and load. Disabling After-Imaging • There are not many good reasons to disable afterimaging. This should be very rare. • Among the possible reasons: – Dumping and loading. – Large, write-intensive processes that can be restarted. • If you must disable after-imaging: – Backup and be prepared to restore. • Allowing users to have access in this period is often not compatible with being able to restore from backup. – Do what needs to be done. – Re-enable after-imaging. – Re-initialize any replication targets. • The actual commands are in the documentation. Tricks! • Getting the next “full” extent: EXTENT=`$DLC/bin/_rfutil ${DB} -C aimage extent full` • Getting an extent’s sequence number: SEQ=`rfutil ${DUMMY} -C aimage scan -a ${EXTENT} | grep number | tail -1 | awk '{print $6}'` • Using the verification database for backups: probkup dbname dbname.pbk –com –norecover < backup.list • Using the backed up verification database for reporting: prorest dbname dbname.pbk < backup.list Conclusion After-Imaging Best Practices • Enable after-imaging on all updateable databases. • Place after-image extents on separate disks from data extents. • Use 8 to 16 variable extents with “large files” enabled. • Run an AIW. • Switch extents as often as the business needs you to. • Use the sequence number when naming archived logs. • Copy archived logs to a remote location ASAP. • Verify your process by continuously rolling forward. • Monitor your “empty” and “full” extents. • Keep at least 30+ days of archived after-image logs. • Establish a dedicated backup and recovery directory.