HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook Roadmap • The Problem (introduction/background) • Map Reduce • Parallel DBMS • HadoopDB • The Approach (HadoopDB) • System Architecture • Performance • Efficiency • Fault tolerance • Benchmarks • Conclusion • Questions The Problem • The amount of STRUCTERED data that needs to be analyzed is exploding • requiring hundreds to thousands of machines to work in parallel to perform the analysis. • Two Major Approaches • Parallel DBMS • Strengths • Weaknesses • Map Reduce DBMS • Strengths • Weaknesses The Problem Parallel DBMS • The amount of STRUCTERED data that needs to be analyzed is exploding • requiring hundreds to thousands of machines to work in parallel to perform the analysis. • Two Major Approaches • Parallel DBMS • Strong emphasis on performance and efficiency • Large scan operations (i.e. multidimensional aggregations, and joins) are easy to parallelize across nodes in a shared-nothing network • Parallel databases have been proven to scale really well into the tens of nodes • Few known deployments consisting of more than one hundred nodes (no 1000+ systems) • Parallel databases tend to be designed with the assumption that failures are a rare event. • Failures become increasingly common as one adds more nodes to a system • Generally assume a homogeneous array of machines (nearly impossible to achieve) • Map Reduce The Problem Map Reduce • The amount of STRUCTERED data that needs to be analyzed is exploding • requiring hundreds to thousands of machines to work in parallel to perform the analysis. • Two Major Approaches • Parallel DBMS • Map Reduce • • • • • • Well suited for performing analysis at this scale 1000+ nodes in a shared-nothing architecture Originally designed for a largely different application (unstructured text data processing) Unfortunately Map Reduce was not originally designed to perform structured data analysis lacks invaluable DBMS features for structured data analysis workloads Lacks the benefits of modeling and loading data before processing • causes an order of magnitude slower performance than parallel databases The Solution HadoopDB • Ideally there should exist a combined solution • Scalability of MapReduce • Performance and efficiency Parallel DBMS • This paper presents such a hybrid system • HadoopDB • The basic idea • Use MapReduce as the communication layer above multiple nodes running single-node DBMS instances • Queries are expressed in SQL • Using HiveQl queries are translated into MapReduce jobs • Much work as possible is pushed into the higher performing single node databases Roadmap • The Problem (introduction/background) • Map Reduce • Parallel DBMS • HadoopDB • The Approach (HadoopDB) • System Architecture • Performance • Efficiency • Fault tolerance • Benchmarks • Conclusion • Questions Approach Map Reduce Job Approach Map Reduce Job SQL Map Reduce Job Approach Map Reduce Job node 1 SQL Map Reduce Job node 2 node N . . . . Approach Map Reduce Job node 1 SQL Map Reduce Job node 2 node N . . . . Approach Map Reduce Job node 1 Map Reduce Job node 2 node N . . . . Approach Map Reduce Job node 1 Map Reduce Job node 2 node N . . . . Roadmap • The Problem (introduction/background) • Map Reduce • Parallel DBMS • HadoopDB • The Approach (HadoopDB) • System Architecture • Performance • Efficiency • Fault tolerance • Benchmarks • Conclusion • Questions SMS Planner - HiveQL • Extends Apache Hive • Apache Hive • • • • Convert SQL to Map Reduce Creates SQL query plan Specifically for Hadoop Not aware of Parallel DBMS • SMS • Extends Hive to take advantage of Parallel DBMS • Optimizes Hive Query Plan SMS Planner – HiveQL (continued) • EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Planner – HiveQL (continued) • EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Planner – HiveQL (continued) • EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); . . . . Roadmap • The Problem (introduction/background) • Map Reduce • Parallel DBMS • HadoopDB • The Approach (HadoopDB) • System Architecture • Performance • Efficiency • Fault tolerance • Benchmarks • Conclusion • Questions Performance and Stability Benchmarks • Vertica • parallel database system • column-store system • DBMS-X • parallel database system • row-oriented system • Hadoop • Map Reduce • HadoopDB • Hybrid Performance and Stability Benchmarks • HadoopDB slightly outperforms Hadoop • However, both systems are outperformed by the parallel databases systems. • Vertica and DBMS-X compress their data, which significantly reduces I/O Performance and Stability Benchmarks • Benefit optimizers present in database systems • HadoopDB outperforms Hadoop • This query is well-suited for column-oriented storage • Vertica significantly outperforms the other systems Performance and Stability Benchmarks • Hadoop • Performance is limited by completely scanning the dataset on each node in order to evaluate the selection predicate. • HadoopDB, DBMS-X, and Vertica all achieve higher performance • Take advantage of DBMS index to accelerate the selection predicate • Native support for joins. Fault Tolerance • Vertica • Shared-Nothing • Paralled DBMS • Hadoop (with Hive) • Map Reduce only • HadoopDB • Hybrid • Map Reduce • Parallel DBMS Node Failure Node Slowdown Fault Tolerance (node failure) • Vertica • Increase in total execution time • Overhead for query abortion and complete restart • Hadoop (with Hive) • Tasks of the failed node are distributed over free nodes that contain replicas of the data • HadoopDB • Tasks of the failed node are distributed over free nodes that contain replicas of the data Node Failure Node Slowdown Fault Tolerance (node slowdown) • Vertica • Performance determined by time it takes for the slowest node to complete • Waits for the straggler to complete • Hadoop (with Hive) • Run redundant tasks free nodes • HadoopDB • Run redundant tasks free nodes Node Failure Node Slowdown Roadmap • The Problem (introduction/background) • Map Reduce • Parallel DBMS • HadoopDB • The Approach (HadoopDB) • System Architecture • Performance • Efficiency • Fault tolerance • Benchmarks • Conclusion • Questions Conclusions • HadoopDB is able to approach the performance of parallel database systems • • • • PostgreSQL is not a column-store did not use data compression in PostgreSQL. Hadoop and Hive are relatively young open-source projects. We expect future releases to enhance performance. • HadoopDB achieves similar fault tolerance of Hadoop (Map Reduce) • Achieved a hybrid of the parallel DBMS and Map Reduce • HadoopDB operate successfully in heterogeneous environments • HadoopDB achieves low cost due to open source Hadoop Questions???