Company Overview

advertisement
NoCOUG Fall Conference 2005
High Availability & Disaster Recovery: A 360o View
Alok Pareek
Agenda
Introduction
High Availability - 2005
Industry Shift from MTTF to MTTR
Understanding transaction failure models
Evaluating HA technologies
Real-World HA Examples
About GoldenGate
Questions & Answers
Speaker Introduction/Background
GoldenGate Software Architect
Transactional Data Management and High Availability Solutions
10 years in Oracle kernel development
Data Storage Technologies
Recovery and High Availability Group
High Speed Data Movement
Key Features
Cross platform/Heterogeneous transportable tablespaces (10g)
Multi threaded redo generation (9.2)
Implementation of multiple block size caches (9.0)
(TSPITR) Tablespace point in time recovery (8.0)
Owner of LGWR process (redo generation) algorithms (8i  10.2)
High Availability, Disaster Recovery
Definition
Ratio of system uptime to sum of uptime and downtime
Availability = MTTF/(MTTF+MTTR)
Challenges
Addressing Performance vs. Reliability in computer systems
Hardware Faults, Software Bugs, Human errors are realities in any
complex system deployment
Enterprise applications need to function 24x7
Disasters are no longer a distant threat
Inadequate planning to handle outages
Shift in industry (and academic research) focus
From fault tolerance to reducing MTTR
HA/DR – Systematic View
Database
1
2
3
Unplanned outage
System Failure
Data Failure
•Node death
•Power failure
•Physical Media
•Logical corruption
Active
Planned outage
System Changes
Data Changes
•Upgrades
•Migrations
•Maintenance
1) High Availability Concerns – No Outage
Database
1
• Performance
• DSS vs. OLTP
conflicting requirments
• Mixed workload
• Data validation
• Data Transformation
Active
2) Availability Concerns – Unplanned
Database
Unplanned outage
Common Approaches
• Database
Restore/Recovery
System failure
• RAID
• Shared Disk Clusters
Data failure
• Standby database
Understanding Database Failures
Failure points
Statement
Process
Instance
Database
Site
Failure types
Physical (Media, corruption, inconsistency amongst redundant copies)
Logical (Incorrect DML, out-of-synch, accidental table drop)
Failure Handling
Automatic
Manual
3) Availability Concerns – Planned Outage
Database
Planned Outage
• Selected windows of
downtime
System Changes
• Phased approach to
maintenance
Data Changes
Key HA Evaluation Criteria
Availability
Is the failover/DR solution available for real use?
MTTR (RTO)
In the event of a failure, how soon can the data be recovered?
Performance
Speed and support for high transaction volumes
Data Loss (RPO)
What is the impact of an unplanned outage in terms of lost
data?
Manageability
Configuration, Install, Monitoring
Impact on Deployed
Systems
How intrusive? What is the impact on data itself?
Cost
Licensing, Maintenance, Deployment
Differentiating HA Technologies
Conventional Backup/Recovery
Block Level Database Replication
RAID
multiple hard disks behaving
as a single large fast drive
Change Level Database
Replication
(mirrors/stripes/duplexing/parity)
Snapshots
Roll Forward / File Protection
Remote Mirroring Solutions
Transactional Data Management
High Availability and
Disaster Recovery
HA Technologies & Tradeoffs
Block based database replication
Standby kept in constant recovery (inactive) mode
Useful for strict disaster recovery only, not HA
Cannot be used for reporting in recovery mode
No write access for distributed load balancing
Application response times suffer after failover
Cannot address availability across heterogeneous systems
Change based database replication
Trigger or log based
Not optimized for real time performance
Intrusive, Complex
Cannot address availability across heterogeneous systems
HA Technologies & Tradeoffs
Remote mirroring solutions
Volume managers maintain mirrors of local writes on a set of remote volumes
Useful for file protection
Physical distance to remote volumes is a critical limitation
No protection from logical corruption, or storage stack corruption
Message based logical writes sent by primary host over IP to remote hosts
(synchronously/asynchronously)
Write ordering must be maintained by primary host
Remote volumes are standby-only, applications cannot access them
No protection from logical corruption
Hardware based
Storage arrays propagate IOs to storage arrays at a secondary site
Secondary arrays are inaccessible during replication
No protection from logical corruption
Only useful for block availability during DR
HA Technologies & Tradeoffs
Transactional Data Management
Addresses low-latency in HA hybrid computing environments (built on
1 Safe)
Management of transactional streams -- captures, transforms, routes,
delivers and verifies data transactions in real time
Real time, heterogeneous, data integrity, low impact
Use cases for HA, DR, data integration, distributed computing
Not for file-level replication
Log-based
data capture
Source
(can be
selective)
Transform
(if needed)
Route
Apply
Target
Sub-second latency
HA/DR – Solution Examples
Database
1
2
3
Unplanned outage
System Failure
Data Failure
•Node death
•Power failure
•Physical Media
•Logical corruption
Active
Planned outage
System Changes
Data Changes
•Upgrades
•Migrations
•Maintenance
Real-World HA Configuration (Active)
Primary
“Master”
Active
• Uni/Bi-directional configuration – if
one system fails, the other is
immediately available and ready to
take over.
• For…
Secondary
“Master”
Live/Active
• Highest Availability
• Maximized ROI on hardware
(transaction balancing)
•Data Centers with Active/Passive
• Example areas:
• 24x7 (ATMs, Online Banking)
• Online Retail
•Real-time Analytics, Warehousing
Real-World HA Configuration (Transaction Off-Loading)
Processing
Commit
Transactions
Active
• Improve availability and
performance of transaction
processing by offloading query load
to lower-cost DBs/platforms
• For…
• Horizontal scalability
• Improved performance
Live/Active
• Example areas:
• Online Reservations
• Online Lookups
Lookup Query Handling
Real-World DR Configuration (Failover)
Database
Active
• An HA implementation that
captures and applies data to a
failover system in real time.
Unplanned outage
• For…
• Fast failover (No restore)
• Do root-cause analysis later!
• Surgical Repair (Dynamic, Selective
undo)
System failure
•Example areas:
• 24x7/mission-critical applications
• Strict SLA requirements
Data failure
Failover
Database
Real-World Configuration (Switchover)
Current
Database
• Zero-Downtime Migrations
• Rolling Upgrades
• Zero-Downtime Maintenance
Planned Outage
• For…
• 24x7 availability
• Reduced windows for system
maintenance
• Example areas:
• Can’t afford downtime to do in-place
upgrade
System Changes
Data Changes
“Switchover”
Database
About GoldenGate Software
GoldenGate Software is a privately held software company
that offers Transactional Data Management solutions.
250 customers... 1500+ solutions implemented… in 35 countries
Established, Loyal
Customer Base
Leading Industry
Solutions
18,000 Node ATM
Network with 24/7
Availability
Achieving paperless
enterprise for this
visionary healthcare
provider
Saving $ millions
with real-time DW
and zero downtime
migrations.
Database tiering
handles average of
300,000 updates/hour,
peaks at 800,000/hour
GoldenGate & Transactional Data Management
Provides guaranteed capture, routing, transformation, delivery, and
verification of data transactions across heterogeneous environments in
real time.
TDM must be:
Real time
Moves with sub-second latency
Heterogeneous
Moves transactions across different
databases and platforms
Transactional
Maintains transaction integrity
GoldenGate differentiates on:
Performance
Handles thousands of transactions per second
with very low impact on IT systems
Extensibility & Flexibility
Open architecture to meet demanding
customer needs and data environments
Reliability
Supports continuous operations and
availability
How GoldenGate Works
Capture: Committed changes are captured as they
occur by reading the transaction logs.
Trail files: Stages and queues data for routing.
Route: Data is compressed, encrypted for routing to targets.
Delivery: Applies transactional data
with guaranteed integrity.
GoldenGate Veridata™: Reports on data
discrepancies between in-use DBs
TDM & HA Evaluation Criteria
Availability
Standby available for reporting, logical repair, load
balancing, failover with warm cache
MTTR
No restore required
Proportional to runtime cost of transactions in last
log write (~ seconds)
Performance
Near zero time latency
Ship only committed transactions
Data Protection / Loss
Redo validation using SQL Apply
No Loss (read access to last IO in current log)
Manageability
Impact
Director GUI, CLI, STATS
Low impact on deployed systems
Metadata outside the database
Questions & Answers
Contact Information:
Alok Pareek
www.goldengate.com
(415) 777-0200
apareek@goldengate.com
301 Howard Street, Suite 2100, San Francisco CA 94105
Lag points address Logical failures
One-to-Many
One-to-One
Source
Source
Target
Target
Target
Target
(current)
(1 hour lag)
(1 day lag)
Download