parallel dbms vs map reduce

advertisement
PARALLEL DBMS VS MAP
REDUCE
“MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J
Dewitt et al.
MAP REDUCE AND PARALLEL DBMS ARE
COMPLEMENTARY
• In 2010, MapReduce (MR) has been hailed as a revolutionary new platform for large-scale,
massively parallel data access
• In 2010, some proponents claimed the extreme scalability of MR will relegate relational
database management systems (DBMS) to the status of legacy technology
• It’s later found that using MR systems to perform tasks that are best suited for DBMSs yields
less than satisfactory results
• As such, MR complements DBMS technology rather than compete with it
• Parallel DBMS were first available nearly two decades ago
• As robust high performing platforms they provide a high-level programming environment that
is inherently parallelizable
• It is possible to write almost any parallel processing task as a set of database queries or a set
of MR jobs
THE SHARED-NOTHING ARCHITECTURE OF
PARALLEL DBMS
•
The initial parallel DBMS systems used the shared-nothing architecture and used
horizontal partitioning of relational tables
•
The use of horizontal partitioning is critical to obtaining scalable performance of SQL queries
•
This leads to the concept of partitioned execution of SQL operators like selection, aggregation,
join etc.
HORIZONTAL PARTITIONING
•
The idea behind horizontal partitioning is to distribute the
rows of the relational table across the nodes of a cluster
so that they can be processed in parallel
MAP REDUCE EXAMPLE IN PARALLEL
DBMS
•
SELECT custId, amount FROM Sales WHERE date BETWEEN
“12/01/2009” AND “12/25/2009”
•
Sales table is round-robin partitioned across the nodes in the
cluster
•
Each SELECT operator scans the fragment of the Sales table
stored at each node
•
Any rows satisfying the date predicate are passed to a
SHUFFLE operator that dynamically repartitions the rows
•
This is done by hashing on the custId
•
Rows are aggregated at each node to find final total for each
customer
MAP-REDUCE ADVANTAGES
• MR is advantageous with ETL and read once data sets. DBMS must
parse and verify each datum in the tuples before loading while MR
does not.
• The Distributed infrastructure used to implement MR is cheap
• Horizontal scalability of MR is better than Parallel DBMS
• MR is an open source project with detailed documentation
• There is no popular open source project on parallel DBMS and all the
popular ones are from commercial vendors
Comparison - Parallel DBMS over MapReduce
Experimental setup
• Used most popular implementations of MR and Parallel DBMS
• Results presented are those achieved after best tuning
1. MR task - Each system must scan through a data set of 100B records looking for a
three-character pattern.
2. Web log task - Conventional SQL aggregation with a GROUP BY clause on a table of
user visits in a Web server log
3. Join task - Fairly complex join operation over two tables requiring an additional
aggregation and fitering operation
Task Name
Hadoop
DBMS-X
Vertica
Hadoop/
DBMS-X
Hadoop/
Vertica
MR Grep task 284s
194s
108s
1.5x
2.6x
Web log task
1146s
740s
268s
1.6x
4.3x
Join task
1158s
32s
55s
36.3x
21.0x
Reasons why PDBMS outperforms
MapReduce in experiment
1. Repetitive record parsing - the default configuration of Hadoop stores data in the
accompanying distributed file system (HDFS), in the same textual format in which the
data was generated.
2. Compression - enabling data compression in the DBMSs delivered a much more
significant performance gain than seen in MR. Reason unknown.
3. Pipelining - Though writing data structures to disk gives Hadoop a convenient way to
checkpoint the output of intermediate map jobs, thereby improving fault tolerance, it
adds significant performance overhead.
4. Scheduling - In a parallel DBMS, each node knows exactly what it must do and when it
must do it according to the distributed query plan. Each task in an MR system is
scheduled on processing nodes one storage block at a time.
5. Column-oriented storage - In a column store-based database (such as Vertica), the
system reads only the attributes necessary for solving the user query.
Conclusion
MR has some good qualities: Out-of-the-box-experience, Most database
systems cannot deal with tables stored in the file system
DBMSs have some good qualities: Technologies and techniques for
efficient query parallel execution, use of higher level languages.
Parallel DBMSs excel at efficient querying of large data sets
MR style systems excel at complex analytics and ETL tasks.
Neither is good at what the other does well. Hence, the two technologies
are complementary.
An ideal system would therefore be a “HYBRID” system.
HadoopDB, 4 Hive, 21 Aster, Greenplum, Cloudera, and Vertica all have
commercially available products or prototypes in this “hybrid” category.
Download