Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010

advertisement
Beyond map/reduce functions
partitioner, combiner and parameter configuration
Gang Luo
Sept. 9, 2010
Partitioner
• Determine which reducer/partition one
record should go to
• Given the key, value and number of
partitions, return an integer
– Partition: (K2, V2, #Partitions)  integer
Partitioner
Interface:
public interface Partitioner<K2, V2> extends JobConfigurable {
int getPartition(K2 key, V2 value, int numPartitions);
}
Implementation:
public class myPartitioner<K2, V2> implements Partitioner<K2, V2>
{
int getPartition(K2 key, V2 value, int numPartitions) {
your logic!
}
}
Partitioner
Example:
public class myPartitioner<Text, Text> implements Partitioner<Text, Text> {
int getPartition(Text key, Text value, int numPartitions) {
int hashCode = key.hashCode();
int partitionIndex = hashCode mod numPartitions;
return partitionIndex;
}
}
Combiner
• Reduce the amount of intermediate data
before sending them to reducers.
• Pre-aggregation
• The interface is exactly the same as
reducer.
Combiner
Example:
public class myCombiner extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> value
OutputCollector<Text, Text> output, Reporter reporter) {
your logic
}
}
Should be the
same type as map
output key/value
Should be the same
type as reducer
input key/value
Parameter
• Cluster-level parameters (e.g. HDFS block
size)
• Job-specific parameters (e.g. number of
reducers, map output buffer size)
– Configurable. Important for job performance
• User-define parameters
– Used to pass information from driver to
mapper/reducer.
– Help to make your mapper/reducer more
generic
Parameter
JobConf conf = new JobConf(Driver.class);
conf.setNumReduceTasks(10); // set the number of reducers by a
// build-in function
conf.set(“io.sort.mb”, “200”); // set the size of map output buffer by the
// name of that parameter
conf.setString(“deliminator”, “\t”); //set a user-defined parameter.
conf.getNumReduceTasks(10); // get the value of a parameter by
// build-in function
String buffSize = conf.get(“io.sort.mb”, “200”); //get the value of a parameter
// by its name
String deliminator = conf.getString(“deliminator”, “\t”); // get the value of a
// user-defined parameter
Parameter
• There are some built-in parameters
managed by Hadoop. We are not
supposed to change them, but can read
them
– String inputFile =
jobConf.get("map.input.file");
– Get the path to the current input
– Used in joining datasets
*for new api, you should use:
More about Hadoop
• Identity Mapper/Reducer
– Output == input. No modification
• Why do we need map/reduce function
without any logic in them?
– Sorting!
– More generally, when you only want to use
the basic functionality provided by Hadoop
(e.g. sorting/grouping)
More about Hadoop
• How to determine the number of splits?
– If a file is large enough and splitable, it will be
splited into multiple pieces (split size = block
size)
– If a file is non-splitable, only one split.
– If a file is small (smaller than a block), one
split for file, unless...
More about Hadoop
• CombineFileInputFormat
– Merge multiple small files into one split, which
will be processed by one mapper
– Save mapper slots. Reduce the overhead
• Other options to handle small files?
– hadoop fs -getmerge src dest
Download