PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al. MAP REDUCE AND PARALLEL DBMS ARE COMPLEMENTARY • In 2010, MapReduce (MR) has been hailed as a revolutionary new platform for large-scale, massively parallel data access • In 2010, some proponents claimed the extreme scalability of MR will relegate relational database management systems (DBMS) to the status of legacy technology • It’s later found that using MR systems to perform tasks that are best suited for DBMSs yields less than satisfactory results • As such, MR complements DBMS technology rather than compete with it • Parallel DBMS were first available nearly two decades ago • As robust high performing platforms they provide a high-level programming environment that is inherently parallelizable • It is possible to write almost any parallel processing task as a set of database queries or a set of MR jobs THE SHARED-NOTHING ARCHITECTURE OF PARALLEL DBMS • The initial parallel DBMS systems used the shared-nothing architecture and used horizontal partitioning of relational tables • The use of horizontal partitioning is critical to obtaining scalable performance of SQL queries • This leads to the concept of partitioned execution of SQL operators like selection, aggregation, join etc. HORIZONTAL PARTITIONING • The idea behind horizontal partitioning is to distribute the rows of the relational table across the nodes of a cluster so that they can be processed in parallel MAP REDUCE EXAMPLE IN PARALLEL DBMS • SELECT custId, amount FROM Sales WHERE date BETWEEN “12/01/2009” AND “12/25/2009” • Sales table is round-robin partitioned across the nodes in the cluster • Each SELECT operator scans the fragment of the Sales table stored at each node • Any rows satisfying the date predicate are passed to a SHUFFLE operator that dynamically repartitions the rows • This is done by hashing on the custId • Rows are aggregated at each node to find final total for each customer MAP-REDUCE ADVANTAGES • MR is advantageous with ETL and read once data sets. DBMS must parse and verify each datum in the tuples before loading while MR does not. • The Distributed infrastructure used to implement MR is cheap • Horizontal scalability of MR is better than Parallel DBMS • MR is an open source project with detailed documentation • There is no popular open source project on parallel DBMS and all the popular ones are from commercial vendors Comparison - Parallel DBMS over MapReduce Experimental setup • Used most popular implementations of MR and Parallel DBMS • Results presented are those achieved after best tuning 1. MR task - Each system must scan through a data set of 100B records looking for a three-character pattern. 2. Web log task - Conventional SQL aggregation with a GROUP BY clause on a table of user visits in a Web server log 3. Join task - Fairly complex join operation over two tables requiring an additional aggregation and fitering operation Task Name Hadoop DBMS-X Vertica Hadoop/ DBMS-X Hadoop/ Vertica MR Grep task 284s 194s 108s 1.5x 2.6x Web log task 1146s 740s 268s 1.6x 4.3x Join task 1158s 32s 55s 36.3x 21.0x Reasons why PDBMS outperforms MapReduce in experiment 1. Repetitive record parsing - the default configuration of Hadoop stores data in the accompanying distributed file system (HDFS), in the same textual format in which the data was generated. 2. Compression - enabling data compression in the DBMSs delivered a much more significant performance gain than seen in MR. Reason unknown. 3. Pipelining - Though writing data structures to disk gives Hadoop a convenient way to checkpoint the output of intermediate map jobs, thereby improving fault tolerance, it adds significant performance overhead. 4. Scheduling - In a parallel DBMS, each node knows exactly what it must do and when it must do it according to the distributed query plan. Each task in an MR system is scheduled on processing nodes one storage block at a time. 5. Column-oriented storage - In a column store-based database (such as Vertica), the system reads only the attributes necessary for solving the user query. Conclusion MR has some good qualities: Out-of-the-box-experience, Most database systems cannot deal with tables stored in the file system DBMSs have some good qualities: Technologies and techniques for efficient query parallel execution, use of higher level languages. Parallel DBMSs excel at efficient querying of large data sets MR style systems excel at complex analytics and ETL tasks. Neither is good at what the other does well. Hence, the two technologies are complementary. An ideal system would therefore be a “HYBRID” system. HadoopDB, 4 Hive, 21 Aster, Greenplum, Cloudera, and Vertica all have commercially available products or prototypes in this “hybrid” category.