The Approach (HadoopDB)

advertisement
HadoopDB
An Architectural Hybrid of Map Reduce and
DBMS Technologies for Analytical Workloads
Presented By: Wen Zhang and Shawn Holbrook
Roadmap
• The Problem (introduction/background)
• Map Reduce
• Parallel DBMS
• HadoopDB
• The Approach (HadoopDB)
• System Architecture
• Performance
• Efficiency
• Fault tolerance
• Benchmarks
• Conclusion
• Questions
The Problem
• The amount of STRUCTERED data that needs to be analyzed is
exploding
• requiring hundreds to thousands of machines to work in parallel to perform
the analysis.
• Two Major Approaches
• Parallel DBMS
• Strengths
• Weaknesses
• Map Reduce DBMS
• Strengths
• Weaknesses
The Problem
Parallel DBMS
• The amount of STRUCTERED data that needs to be analyzed is exploding
• requiring hundreds to thousands of machines to work in parallel to perform the analysis.
• Two Major Approaches
• Parallel DBMS
• Strong emphasis on performance and efficiency
• Large scan operations (i.e. multidimensional aggregations, and joins) are easy to parallelize
across nodes in a shared-nothing network
• Parallel databases have been proven to scale really well into the tens of nodes
• Few known deployments consisting of more than one hundred nodes (no 1000+ systems)
• Parallel databases tend to be designed with the assumption that failures are a rare event.
• Failures become increasingly common as one adds more nodes to a system
• Generally assume a homogeneous array of machines (nearly impossible to achieve)
• Map Reduce
The Problem
Map Reduce
• The amount of STRUCTERED data that needs to be analyzed is exploding
• requiring hundreds to thousands of machines to work in parallel to perform the analysis.
• Two Major Approaches
• Parallel DBMS
• Map Reduce
•
•
•
•
•
•
Well suited for performing analysis at this scale
1000+ nodes in a shared-nothing architecture
Originally designed for a largely different application (unstructured text data processing)
Unfortunately Map Reduce was not originally designed to perform structured data analysis
lacks invaluable DBMS features for structured data analysis workloads
Lacks the benefits of modeling and loading data before processing
• causes an order of magnitude slower performance than parallel databases
The Solution
HadoopDB
• Ideally there should exist a combined solution
• Scalability of MapReduce
• Performance and efficiency Parallel DBMS
• This paper presents such a hybrid system
• HadoopDB
• The basic idea
• Use MapReduce as the communication layer above multiple nodes running single-node
DBMS instances
• Queries are expressed in SQL
• Using HiveQl queries are translated into MapReduce jobs
• Much work as possible is pushed into the higher performing single node databases
Roadmap
• The Problem (introduction/background)
• Map Reduce
• Parallel DBMS
• HadoopDB
• The Approach (HadoopDB)
• System Architecture
• Performance
• Efficiency
• Fault tolerance
• Benchmarks
• Conclusion
• Questions
Approach
Map Reduce
Job
Approach
Map Reduce
Job
SQL
Map Reduce
Job
Approach
Map Reduce
Job
node 1
SQL
Map Reduce
Job
node 2
node N
. . . .
Approach
Map Reduce
Job
node 1
SQL
Map Reduce
Job
node 2
node N
. . . .
Approach
Map Reduce
Job
node 1
Map Reduce
Job
node 2
node N
. . . .
Approach
Map Reduce
Job
node 1
Map Reduce
Job
node 2
node N
. . . .
Roadmap
• The Problem (introduction/background)
• Map Reduce
• Parallel DBMS
• HadoopDB
• The Approach (HadoopDB)
• System Architecture
• Performance
• Efficiency
• Fault tolerance
• Benchmarks
• Conclusion
• Questions
SMS Planner - HiveQL
• Extends Apache Hive
• Apache Hive
•
•
•
•
Convert SQL to Map Reduce
Creates SQL query plan
Specifically for Hadoop
Not aware of Parallel DBMS
• SMS
• Extends Hive to take advantage of
Parallel DBMS
• Optimizes Hive Query Plan
SMS Planner – HiveQL
(continued)
• EXAMPLE
SELECT
YEAR(saleDate), SUM(revenue)
FROM
sales
GROUP BY YEAR(saleDate);
SMS Planner – HiveQL
(continued)
• EXAMPLE
SELECT
YEAR(saleDate), SUM(revenue)
FROM
sales
GROUP BY YEAR(saleDate);
SMS Planner – HiveQL
(continued)
• EXAMPLE
SELECT
YEAR(saleDate), SUM(revenue)
FROM
sales
GROUP BY YEAR(saleDate);
. . . .
Roadmap
• The Problem (introduction/background)
• Map Reduce
• Parallel DBMS
• HadoopDB
• The Approach (HadoopDB)
• System Architecture
• Performance
• Efficiency
• Fault tolerance
• Benchmarks
• Conclusion
• Questions
Performance and Stability Benchmarks
• Vertica
• parallel database system
• column-store system
• DBMS-X
• parallel database system
• row-oriented system
• Hadoop
• Map Reduce
• HadoopDB
• Hybrid
Performance and Stability Benchmarks
• HadoopDB slightly outperforms
Hadoop
• However, both systems are
outperformed by the parallel
databases systems.
• Vertica and DBMS-X compress
their data, which significantly
reduces I/O
Performance and Stability Benchmarks
• Benefit optimizers present in
database systems
• HadoopDB outperforms Hadoop
• This query is well-suited for
column-oriented storage
• Vertica significantly outperforms
the other systems
Performance and Stability Benchmarks
• Hadoop
• Performance is limited by
completely scanning the dataset
on each node in order to evaluate
the selection predicate.
• HadoopDB, DBMS-X, and Vertica
all achieve higher performance
• Take advantage of DBMS index to
accelerate the selection predicate
• Native support for joins.
Fault Tolerance
• Vertica
• Shared-Nothing
• Paralled DBMS
• Hadoop (with Hive)
• Map Reduce only
• HadoopDB
• Hybrid
• Map Reduce
• Parallel DBMS
Node Failure
Node Slowdown
Fault Tolerance (node failure)
• Vertica
• Increase in total execution time
• Overhead for query abortion and
complete restart
• Hadoop (with Hive)
• Tasks of the failed node are
distributed over free nodes that
contain replicas of the data
• HadoopDB
• Tasks of the failed node are
distributed over free nodes that
contain replicas of the data
Node Failure
Node Slowdown
Fault Tolerance (node slowdown)
• Vertica
• Performance determined by time
it takes for the slowest node to
complete
• Waits for the straggler to complete
• Hadoop (with Hive)
• Run redundant tasks free nodes
• HadoopDB
• Run redundant tasks free nodes
Node Failure
Node Slowdown
Roadmap
• The Problem (introduction/background)
• Map Reduce
• Parallel DBMS
• HadoopDB
• The Approach (HadoopDB)
• System Architecture
• Performance
• Efficiency
• Fault tolerance
• Benchmarks
• Conclusion
• Questions
Conclusions
• HadoopDB is able to approach the performance of parallel database
systems
•
•
•
•
PostgreSQL is not a column-store
did not use data compression in PostgreSQL.
Hadoop and Hive are relatively young open-source projects.
We expect future releases to enhance performance.
• HadoopDB achieves similar fault tolerance of Hadoop (Map Reduce)
• Achieved a hybrid of the parallel DBMS and Map Reduce
• HadoopDB operate successfully in heterogeneous environments
• HadoopDB achieves low cost due to open source Hadoop
Questions???
Download