Presentation - UTPA Faculty Web

advertisement
Presenter: Ran Ding





1.
2.
3.
4.
5.
Introduction
Where the MR wins
DBMS “sweet spot” tests
Why the Parallel DBMS wins
Conclusion


The MapReduce (MR) paradigm has been hailed as
a revolutionary new platform for large-scale,
massively parallel data access.
Like Hadoop

Parallel DBMS appeared at mid-1980. the
Teradata and Gamma projects pioneered a
new architectural paradigm based on a
cluster of commodity computers.

Distributing the rows of a relational table
across the nodes of the cluster so they can
process in parallel.


One benefit is system automatically manages
the various alternative partitioning strategies
for the tables involved in the query.
Like hash, range, and round-robin…..
 It


is not easy!!!!!!
UDF(user defined field) helps.
Like GROUP BY in SQL.





1.
2.
3.
4.
5.
ETL and “read once” data sets
Complex analytics
Semi-structured data
Quick-and-dirty analyses
Limited-budget operations



Extract-transform-load system
MR system can be considered a generalpurpose parallel ETL system.
DBMSs may perform the ETL


Cannot be structured as single SQL aggregate
queries
MR is a good candidate



MR systems are good at processing the data
is prepared for loading into a back-end
system
DBMS requires wide tables with many
attributes
Plus, MR-style systems are easily store and
process


DBMS need the programmer write the schema
then load
MR just copy!

MR is basically open source for free

Parallel DBMS: huge cost





1.
2.
3.
4.
5.
Repetitive record parsing
Compression
Pipelining
Scheduling
Column-oriented storage


Parsing task requires each Map and Reduce
task repeatedly parse and convert string
fields into the appropriate type
Records are parsed by DBMSs when the data
is initially loaded.


It is hard to say……..
Commercial DBMSs may use carefully tuned
compression algorithms



In parallel DBMS, data is streamed from
producer to consumer
the intermediate data is never written to disk
In MR system, it writes the result to local data
structure, and consumers read from it


In a parallel DBMS, every node knows what it
should do
MR system is scheduled on processing nodes
one storage block at a time.



Vertica
Reads only the attributes necessary for
solving the user query
DBMS-X and Hadoop are both row stores

MR advocates should learn from parallel
DBMS the technologies and techniques for
efficient query parallel execution.

MR systems are powerful tools for ETL-style
applications and for complex analytics. If the
application is query-intensive, whether semi
structured or rigidly structured, then a DBMS
is probably the better choice
Download