NoCOUG Fall Conference 2005 High Availability & Disaster Recovery: A 360o View Alok Pareek Agenda Introduction High Availability - 2005 Industry Shift from MTTF to MTTR Understanding transaction failure models Evaluating HA technologies Real-World HA Examples About GoldenGate Questions & Answers Speaker Introduction/Background GoldenGate Software Architect Transactional Data Management and High Availability Solutions 10 years in Oracle kernel development Data Storage Technologies Recovery and High Availability Group High Speed Data Movement Key Features Cross platform/Heterogeneous transportable tablespaces (10g) Multi threaded redo generation (9.2) Implementation of multiple block size caches (9.0) (TSPITR) Tablespace point in time recovery (8.0) Owner of LGWR process (redo generation) algorithms (8i 10.2) High Availability, Disaster Recovery Definition Ratio of system uptime to sum of uptime and downtime Availability = MTTF/(MTTF+MTTR) Challenges Addressing Performance vs. Reliability in computer systems Hardware Faults, Software Bugs, Human errors are realities in any complex system deployment Enterprise applications need to function 24x7 Disasters are no longer a distant threat Inadequate planning to handle outages Shift in industry (and academic research) focus From fault tolerance to reducing MTTR HA/DR – Systematic View Database 1 2 3 Unplanned outage System Failure Data Failure •Node death •Power failure •Physical Media •Logical corruption Active Planned outage System Changes Data Changes •Upgrades •Migrations •Maintenance 1) High Availability Concerns – No Outage Database 1 • Performance • DSS vs. OLTP conflicting requirments • Mixed workload • Data validation • Data Transformation Active 2) Availability Concerns – Unplanned Database Unplanned outage Common Approaches • Database Restore/Recovery System failure • RAID • Shared Disk Clusters Data failure • Standby database Understanding Database Failures Failure points Statement Process Instance Database Site Failure types Physical (Media, corruption, inconsistency amongst redundant copies) Logical (Incorrect DML, out-of-synch, accidental table drop) Failure Handling Automatic Manual 3) Availability Concerns – Planned Outage Database Planned Outage • Selected windows of downtime System Changes • Phased approach to maintenance Data Changes Key HA Evaluation Criteria Availability Is the failover/DR solution available for real use? MTTR (RTO) In the event of a failure, how soon can the data be recovered? Performance Speed and support for high transaction volumes Data Loss (RPO) What is the impact of an unplanned outage in terms of lost data? Manageability Configuration, Install, Monitoring Impact on Deployed Systems How intrusive? What is the impact on data itself? Cost Licensing, Maintenance, Deployment Differentiating HA Technologies Conventional Backup/Recovery Block Level Database Replication RAID multiple hard disks behaving as a single large fast drive Change Level Database Replication (mirrors/stripes/duplexing/parity) Snapshots Roll Forward / File Protection Remote Mirroring Solutions Transactional Data Management High Availability and Disaster Recovery HA Technologies & Tradeoffs Block based database replication Standby kept in constant recovery (inactive) mode Useful for strict disaster recovery only, not HA Cannot be used for reporting in recovery mode No write access for distributed load balancing Application response times suffer after failover Cannot address availability across heterogeneous systems Change based database replication Trigger or log based Not optimized for real time performance Intrusive, Complex Cannot address availability across heterogeneous systems HA Technologies & Tradeoffs Remote mirroring solutions Volume managers maintain mirrors of local writes on a set of remote volumes Useful for file protection Physical distance to remote volumes is a critical limitation No protection from logical corruption, or storage stack corruption Message based logical writes sent by primary host over IP to remote hosts (synchronously/asynchronously) Write ordering must be maintained by primary host Remote volumes are standby-only, applications cannot access them No protection from logical corruption Hardware based Storage arrays propagate IOs to storage arrays at a secondary site Secondary arrays are inaccessible during replication No protection from logical corruption Only useful for block availability during DR HA Technologies & Tradeoffs Transactional Data Management Addresses low-latency in HA hybrid computing environments (built on 1 Safe) Management of transactional streams -- captures, transforms, routes, delivers and verifies data transactions in real time Real time, heterogeneous, data integrity, low impact Use cases for HA, DR, data integration, distributed computing Not for file-level replication Log-based data capture Source (can be selective) Transform (if needed) Route Apply Target Sub-second latency HA/DR – Solution Examples Database 1 2 3 Unplanned outage System Failure Data Failure •Node death •Power failure •Physical Media •Logical corruption Active Planned outage System Changes Data Changes •Upgrades •Migrations •Maintenance Real-World HA Configuration (Active) Primary “Master” Active • Uni/Bi-directional configuration – if one system fails, the other is immediately available and ready to take over. • For… Secondary “Master” Live/Active • Highest Availability • Maximized ROI on hardware (transaction balancing) •Data Centers with Active/Passive • Example areas: • 24x7 (ATMs, Online Banking) • Online Retail •Real-time Analytics, Warehousing Real-World HA Configuration (Transaction Off-Loading) Processing Commit Transactions Active • Improve availability and performance of transaction processing by offloading query load to lower-cost DBs/platforms • For… • Horizontal scalability • Improved performance Live/Active • Example areas: • Online Reservations • Online Lookups Lookup Query Handling Real-World DR Configuration (Failover) Database Active • An HA implementation that captures and applies data to a failover system in real time. Unplanned outage • For… • Fast failover (No restore) • Do root-cause analysis later! • Surgical Repair (Dynamic, Selective undo) System failure •Example areas: • 24x7/mission-critical applications • Strict SLA requirements Data failure Failover Database Real-World Configuration (Switchover) Current Database • Zero-Downtime Migrations • Rolling Upgrades • Zero-Downtime Maintenance Planned Outage • For… • 24x7 availability • Reduced windows for system maintenance • Example areas: • Can’t afford downtime to do in-place upgrade System Changes Data Changes “Switchover” Database About GoldenGate Software GoldenGate Software is a privately held software company that offers Transactional Data Management solutions. 250 customers... 1500+ solutions implemented… in 35 countries Established, Loyal Customer Base Leading Industry Solutions 18,000 Node ATM Network with 24/7 Availability Achieving paperless enterprise for this visionary healthcare provider Saving $ millions with real-time DW and zero downtime migrations. Database tiering handles average of 300,000 updates/hour, peaks at 800,000/hour GoldenGate & Transactional Data Management Provides guaranteed capture, routing, transformation, delivery, and verification of data transactions across heterogeneous environments in real time. TDM must be: Real time Moves with sub-second latency Heterogeneous Moves transactions across different databases and platforms Transactional Maintains transaction integrity GoldenGate differentiates on: Performance Handles thousands of transactions per second with very low impact on IT systems Extensibility & Flexibility Open architecture to meet demanding customer needs and data environments Reliability Supports continuous operations and availability How GoldenGate Works Capture: Committed changes are captured as they occur by reading the transaction logs. Trail files: Stages and queues data for routing. Route: Data is compressed, encrypted for routing to targets. Delivery: Applies transactional data with guaranteed integrity. GoldenGate Veridata™: Reports on data discrepancies between in-use DBs TDM & HA Evaluation Criteria Availability Standby available for reporting, logical repair, load balancing, failover with warm cache MTTR No restore required Proportional to runtime cost of transactions in last log write (~ seconds) Performance Near zero time latency Ship only committed transactions Data Protection / Loss Redo validation using SQL Apply No Loss (read access to last IO in current log) Manageability Impact Director GUI, CLI, STATS Low impact on deployed systems Metadata outside the database Questions & Answers Contact Information: Alok Pareek www.goldengate.com (415) 777-0200 apareek@goldengate.com 301 Howard Street, Suite 2100, San Francisco CA 94105 Lag points address Logical failures One-to-Many One-to-One Source Source Target Target Target Target (current) (1 hour lag) (1 day lag)