Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010 Partitioner • Determine which reducer/partition one record should go to • Given the key, value and number of partitions, return an integer – Partition: (K2, V2, #Partitions) integer Partitioner Interface: public interface Partitioner<K2, V2> extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); } Implementation: public class myPartitioner<K2, V2> implements Partitioner<K2, V2> { int getPartition(K2 key, V2 value, int numPartitions) { your logic! } } Partitioner Example: public class myPartitioner<Text, Text> implements Partitioner<Text, Text> { int getPartition(Text key, Text value, int numPartitions) { int hashCode = key.hashCode(); int partitionIndex = hashCode mod numPartitions; return partitionIndex; } } Combiner • Reduce the amount of intermediate data before sending them to reducers. • Pre-aggregation • The interface is exactly the same as reducer. Combiner Example: public class myCombiner extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> value OutputCollector<Text, Text> output, Reporter reporter) { your logic } } Should be the same type as map output key/value Should be the same type as reducer input key/value Parameter • Cluster-level parameters (e.g. HDFS block size) • Job-specific parameters (e.g. number of reducers, map output buffer size) – Configurable. Important for job performance • User-define parameters – Used to pass information from driver to mapper/reducer. – Help to make your mapper/reducer more generic Parameter JobConf conf = new JobConf(Driver.class); conf.setNumReduceTasks(10); // set the number of reducers by a // build-in function conf.set(“io.sort.mb”, “200”); // set the size of map output buffer by the // name of that parameter conf.setString(“deliminator”, “\t”); //set a user-defined parameter. conf.getNumReduceTasks(10); // get the value of a parameter by // build-in function String buffSize = conf.get(“io.sort.mb”, “200”); //get the value of a parameter // by its name String deliminator = conf.getString(“deliminator”, “\t”); // get the value of a // user-defined parameter Parameter • There are some built-in parameters managed by Hadoop. We are not supposed to change them, but can read them – String inputFile = jobConf.get("map.input.file"); – Get the path to the current input – Used in joining datasets *for new api, you should use: More about Hadoop • Identity Mapper/Reducer – Output == input. No modification • Why do we need map/reduce function without any logic in them? – Sorting! – More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping) More about Hadoop • How to determine the number of splits? – If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) – If a file is non-splitable, only one split. – If a file is small (smaller than a block), one split for file, unless... More about Hadoop • CombineFileInputFormat – Merge multiple small files into one split, which will be processed by one mapper – Save mapper slots. Reduce the overhead • Other options to handle small files? – hadoop fs -getmerge src dest