Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011 MapReduce is victorious • Google statistics: Aug 04 Mar 06 Sept 07 May 10 Number of jobs 29K 171K 2127K 4474K Machine years used 217 2002 11081 39121 Input Data (TB) 3,288 52,254 403,152 946,460 Output Data (TB) 193 2,970 14,018 45,720 Average worker machines 157 268 394 368 • Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters1 1. Omer Trajman, Cloudera VP, http://www.dbms2.com/ MapReduce in relational land • Designers original Intention: free-formed data o web-scale indexing/log processing • But, many relational workloads1 o Complex queries/data analysis • Caveat: MR performance lags RDBMS performance 1. Karmasphere corporation: A study of hadoop developers, http://karmasphere.com, 2010 Selection is Slower with MapReduce Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Join is Even Slower Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 MR Lags in Relational Land • Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’1 • Query processing tasks o No metadata, semantics, indices o Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008 Manimal • Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques • Techniques today only found in RDBMS, but should be in MapReduce, too. Manimal Approach MR Engine bytecode *.class optimization Static Analyzer opportunities Optimizer logic void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } execution path Execution Framework SELECTION from B+Tree index on W.RANK • Challenges: o Safely detect query semantic optimization o How much performance gain? Manimal Contributions • Our Manimal system: o Detect safe relational optimizations in users’ compiled MapReduce programs • Our results: o Runs with unmodified MapReduce code o Runs up to 11x faster on same code o Provides framework for more optimizations Outline • • • • Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain • Related Work and Conclusion Execution framework public void map(Text key, WebPage w, OutputCollector<Text, LongWritable> out) { if(w.rank > 10) emit(w.url, w.rank); } Execution Framework Analyzer varload ‘value’ invokevirtual astore ‘text’ … ifeq … Optimizer Execution Execution Framework (SELECT f, w.rank>10) Analyzer varload ‘value’ invokevirtual astore ‘text’ … ifeq … Optimizer Execution void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } Analyzer in: user program Analyzer out: optimization descriptor index-generation program 13 Execution Framework /logs/log.1 /logs/log.1.idx select src… /logs/log.2 /logs/log.2.idx select src… (SELECT,“log.1.idx”, w.rank>10) (SELECT f, w.rank>10) Analyzer varload ‘value’ invokevirtual astore ‘text’ … ifeq … Optimizer Execution Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor 14 Execution Framework (SELECT,“log.1.idx”, w.rank>10) Analyzer varload ‘value’ invokevirtual astore ‘text’ … ifeq … Optimizer Execution numwords 19519 Execution in: execution descriptor user program Execution out: program output 15 Outline • • • • Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain • Related Work and Conclusion An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,int rank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage • Data-centric programming idioms == relational ops Semantic Extraction • Query semantic are obvious to human readers, but not explicit in the code for framework • EXTRACT IT! o o o o Static code analysis Control-flow graph and data-flow graph Find opportunities: selection, projection, direct op Safe optimizations: same output Analyzer: An Example //webpage.java Class WebPage {String URL,int rank,String content} Fn Entry w.rank > 10 Analyzer //mapper.java map(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank); } emit(url,rank) Fn Exit Current Optimizations • • • • B+-Tree for Selections Projected views Delta compression on numerics Direct operation of compressed data • Hadoop compression is not semantic aware Outline • • • • Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain • Related Work and Conclusion Experiments: Analyzer • Test MapReduce programs from Pavlo, SIGMOD ‘09: • Detected 5 out of 8 opportunities: o Two misses due to custom serialization class o Another miss requires knowledge of java.util.Hashtable semantics Experiments: Performance • Optimize four Web page handling tasks: o Selection (filtering) o Projection (aggregation on subfield of page) o Join (pages to user visits) o User Defined Functions (aggregation) • 5 cluster nodes, 123GB of data Experiments: Performance Description Hadoop Selection 430 s Projection 5496 s Join 6078 s Experiments: Performance Description Hadoop Manimal Speedup Selection 430 s 38 s 11.2 Projection 5496 s 1856 s 2.96 Join 6078 s 904 s 6.73 Experiments: Performance Description Hadoop Manimal Speedup Space Overhead Selection 430 s 38 s 11.2 0.1% Projection 5496 s 1856 s 2.96 20% Join 6078 s 904 s 6.73 11.7% • Up to 11x speedup over original Hadoop • Performance comparable to DBMS-X from Pavlo • UDF not detected: running time identical Outline • • • • Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain • Related Work and Conclusion Related Work • Lots of recent MapReduce activity o Quincy: Task scheduling (Isard et al, SOSP, 2009) o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010) o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010) o Starfish (Herodotou et al, CIDR 2011) • Manimal does not introduce new optimizations. It detects and applies existing optimizations to code Lessons Learned • The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world • The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin) Conclusion • Manimal provides framework for applying well-known optimization techniques to MapReduce o Automatic optimization of user code o Up to 11x speed increase o Provides framework for more optimizations