Slides

advertisement
Automatic optimization of
MapReduce Programs
Michael Cafarella, Eaman Jahani, Christopher Re
August 2011
MapReduce is victorious
• Google statistics:
Aug 04
Mar 06
Sept 07
May 10
Number of jobs
29K
171K
2127K
4474K
Machine years used
217
2002
11081
39121
Input Data (TB)
3,288
52,254
403,152
946,460
Output Data (TB)
193
2,970
14,018
45,720
Average worker machines
157
268
394
368
• Hadoop statistics:
7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters1
1. Omer Trajman, Cloudera VP, http://www.dbms2.com/
MapReduce in relational land
• Designers original Intention: free-formed data
o web-scale indexing/log processing
• But, many relational workloads1
o Complex queries/data analysis
• Caveat: MR performance lags RDBMS performance
1. Karmasphere corporation: A study of hadoop developers, http://karmasphere.com, 2010
Selection is Slower with MapReduce
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
Join is Even Slower
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
MR Lags in Relational Land
• Stonebraker, Dewitt:
''MapReduce has no indexes and therefore has only brute
force as a processing option. It will be creamed whenever an
index is the better access mechanism.’’1
• Query processing tasks
o No metadata, semantics, indices
o Free-formed input is a double-edged sword
1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008
Manimal
• Manimal is a hybrid system, combining MapReduce
programming model and well-known execution
techniques
• Techniques today
only found in
RDBMS, but should
be in MapReduce,
too.
Manimal Approach
MR
Engine
bytecode *.class
optimization
Static
Analyzer opportunities
Optimizer
logic
void map(Text key, WebPage w) {
if(w.rank > 10)
emit(w.url,w.rank);
}
execution
path
Execution
Framework
SELECTION from B+Tree
index on W.RANK
• Challenges:
o Safely detect query semantic optimization
o How much performance gain?
Manimal Contributions
• Our Manimal system:
o Detect safe relational optimizations in users’
compiled MapReduce programs
• Our results:
o Runs with unmodified MapReduce code
o Runs up to 11x faster on same code
o Provides framework for more optimizations
Outline
•
•
•
•
Introduction
Execution Framework
Optimization/Analyzer Examples
Experiments
o Analyzer recall
o Performance gain
• Related Work and Conclusion
Execution framework
public void map(Text key, WebPage w,
OutputCollector<Text, LongWritable> out) {
if(w.rank > 10)
emit(w.url, w.rank);
}
Execution Framework
Analyzer
varload ‘value’
invokevirtual
astore ‘text’
…
ifeq …
Optimizer
Execution
Execution Framework
(SELECT f, w.rank>10)
Analyzer
varload ‘value’
invokevirtual
astore ‘text’
…
ifeq …
Optimizer
Execution
void map(k, w) {
out.set(indexedOutputFormat);
emit(w.rank, (k,w)) }
Analyzer in: user program
Analyzer out: optimization descriptor
index-generation program
13
Execution Framework
/logs/log.1
/logs/log.1.idx
select src…
/logs/log.2
/logs/log.2.idx
select src…
(SELECT,“log.1.idx”,
w.rank>10)
(SELECT f, w.rank>10)
Analyzer
varload ‘value’
invokevirtual
astore ‘text’
…
ifeq …
Optimizer
Execution
Optimizer in: optimization descriptor
catalog
Optimizer out: execution descriptor
14
Execution Framework
(SELECT,“log.1.idx”,
w.rank>10)
Analyzer
varload ‘value’
invokevirtual
astore ‘text’
…
ifeq …
Optimizer
Execution
numwords 19519
Execution in: execution descriptor
user program
Execution out: program output
15
Outline
•
•
•
•
Introduction
Execution Framework
Optimization/Analyzer Examples
Experiments
o Analyzer recall
o Performance gain
• Related Work and Conclusion
An Optimization Example
//webpage.java: SCHEMA!
Class WebPage {String URL,int rank,String content}
//mapper.java
void map(Text key, WebPage w) {
if (w.url==‘teaparty.fr’)
emit(w.url, 1);
}
PROJECTED view: (url,null,null)
DIRECT-OP on compressed
Webpage
• Data-centric programming idioms == relational ops
Semantic Extraction
• Query semantic are obvious to human readers, but
not explicit in the code for framework
• EXTRACT IT!
o
o
o
o
Static code analysis
Control-flow graph and data-flow graph
Find opportunities: selection, projection, direct op
Safe optimizations: same output
Analyzer: An Example
//webpage.java
Class WebPage {String URL,int rank,String content}
Fn Entry
w.rank > 10
Analyzer
//mapper.java
map(Text key,Webpage w) {
if (w.rank > 10)
emit(w.url,w.rank);
}
emit(url,rank)
Fn Exit
Current Optimizations
•
•
•
•
B+-Tree for Selections
Projected views
Delta compression on numerics
Direct operation of compressed data
• Hadoop compression is not semantic aware
Outline
•
•
•
•
Introduction
Execution Framework
Optimization/Analyzer Examples
Experiments
o Analyzer recall
o Performance gain
• Related Work and Conclusion
Experiments: Analyzer
• Test MapReduce programs from Pavlo, SIGMOD ‘09:
• Detected 5 out of 8 opportunities:
o Two misses due to custom serialization class
o Another miss requires knowledge of
java.util.Hashtable semantics
Experiments: Performance
• Optimize four Web page handling tasks:
o Selection (filtering)
o Projection (aggregation on subfield of page)
o Join (pages to user visits)
o User Defined Functions (aggregation)
• 5 cluster nodes, 123GB of data
Experiments: Performance
Description
Hadoop
Selection
430 s
Projection
5496 s
Join
6078 s
Experiments: Performance
Description
Hadoop
Manimal
Speedup
Selection
430 s
38 s
11.2
Projection
5496 s
1856 s
2.96
Join
6078 s
904 s
6.73
Experiments: Performance
Description
Hadoop
Manimal
Speedup
Space Overhead
Selection
430 s
38 s
11.2
0.1%
Projection
5496 s
1856 s
2.96
20%
Join
6078 s
904 s
6.73
11.7%
• Up to 11x speedup over original Hadoop
• Performance comparable to DBMS-X from Pavlo
• UDF not detected: running time identical
Outline
•
•
•
•
Introduction
Execution Framework
Optimization/Analyzer Examples
Experiments
o Analyzer recall
o Performance gain
• Related Work and Conclusion
Related Work
• Lots of recent MapReduce activity
o Quincy: Task scheduling (Isard et al, SOSP, 2009)
o HadoopDB (Abouzeid et al, PVLDB 2009)
o Hadoop++ (Dittrich et al, PVLDB 2010)
o HaLoop (Bu et al, PVLDB 2010)
o Twister (Ekanayake et al, HPDC 2010)
o Starfish (Herodotou et al, CIDR 2011)
•
Manimal does not introduce new optimizations. It
detects and applies existing optimizations to code
Lessons Learned
• The Good: We can recognize data processing
idioms in real code. Relational operations still exist
even in NoSQL world
• The Ugly: When we started this project in 2009, we
underestimated interest in writing in higher level
languages (e.g., Pig Latin)
Conclusion
• Manimal provides framework for applying
well-known optimization techniques to
MapReduce
o Automatic optimization of user code
o Up to 11x speed increase
o Provides framework for more optimizations
Download