Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net Outline Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter Brief Review A parallel programming framework Divide and merge Input data Mappers split0 Map task split1 split2 Shuffle Reducers Output data Reduce task output0 Reduce task output1 Map task Map task Chaining MapReduce jobs Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing steps Chaining in a sequence Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes Job1 Job2 Job3 Configuration conf = getConf(); JobConf job = new JobConf(conf); job.setJobName("ChainJob"); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); JobConf map1Conf = new JobConf(false); ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf); Chaining with complex dependency Jobs are not chained in a linear fashion Job1 Job2 Job3 Use addDependingJob() method to add dependency information: x.addDependingJob(y) Chaining preprocessing and postprocessing steps Example: remove stop word in IR Approaches: Separate: inefficient Chaining those steps into a single job Use ChainMapper.addMapper() and ChainReducer.setReducer Map+ | Reduce | Map* Join in MapReduce Reduce-side join Broadcast join Map-side filtering and Reduce-side join A given key A range from dataset(broadcast) a Bloom filter Reduce-side join Map output <key, value> key>>join key, value>>tagged with data source Reduce do a full cross-product of values output the combination results Example table x a b 1 ab 1 cd 4 ef key map() value 1 x ab 1 x cd 4 key shuffle() valuelist 1 x cd y b x ef reduce() join key table y a c 1 b 2 4 tag key value 1 y b d 2 y d c 4 y c map() output x ab 2 4 y d x ef y c a b c 1 ab b 1 cd b 4 ef c Broadcast join (replicated join) Broadcast the smaller table Do join in Map() Using distributed cache DistributedCache.addCacheFile() Map-side filtering and Reduceside join Join key: student IDs from info generate IDs file from info broadcast join What if the IDs file can’t be stored in memory? a Bloom Filter A Bloom Filter Introduction Implementation of bloom filter Use in MapReduce join Introduction to Bloom Filter space-efficient data structure, constant size, test elements, add(), contains() no false negatives and a small probability of false positives Implementation of bloom filter Apply a bit array Add elements generate k indexes set the k bits to 1 Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false Example initial state add x(0,2,6) false positives √ × add y(0,3,9) contain m(1,3,9) contain n(0,2,9) 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 2 0 2 1 2 1 2 1 2 1 3 0 3 0 3 1 3 1 3 1 4 0 4 0 4 0 4 0 4 0 5 0 5 0 5 0 5 0 5 0 6 0 6 1 6 1 6 1 6 1 7 0 7 0 7 0 7 0 7 0 8 0 8 0 8 0 8 0 8 0 9 0 9 0 9 1 9 1 9 1 ① ② ③ ④ ⑤ Use in MapReduce join A separate subjob to create a Bloom Filter Broadcast the Bloom Filter and use in Map() of join job drop the useless record, and do join in reduce References Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using Map/Reduce” THANK YOU Hadoop