http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com http://www.excelonlineclasses.co.nr/ Excel Online Classes offers following services: Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector - Nagarjuna K • Anatomy of MapReduce • • • • • • • MR work flow Hadoop data types Mapper Reducer Partitioner Combiner Input Split vs Block Size Partitioning Shuffling . I N P U T D A T A NODE 1 Map Interim data Reduce Node to store output NODE 2 Map Interim data Reduce Node to store output NODE 2 Map Interim data Reduce Node to store output MR has a defined way of keys and values types for it to move across cluster Values Writable Keys WritableComparable<T> WritableComparable = Writable+Comparable<T> Hadoop type Wrapper for Java type BooleanWritable Boolean ByteWritable Byte DoubleWritable Double IntWritable Integer LongWritable Long Text String NullWritable Placeholder when key/value not needed For any class to be value, it has to implement org.apache.hadoop.io.Writable write(DataOutput out) readFields(DataInput in) For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T> + compareTo(T o) Check out few of the writables and writable comparable Time to write your own writables Two libraries in Hadoop org.apache.hadoop.mapred.* org.apache.hadoop.mapreduce.* Should implement org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2> ▪ Void configure(JobConf job) ▪ All the parameters specified in the xmls are available here. ▪ Any parameter explicitly set are also available ▪ Call before data processing starts ▪ Void map (K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter) ▪ Data process starts ▪ Void Close() ▪ Should close any files, db connections etc., ▪ Reporter provides extra information of mapper to TT Mapper Functionality IdentityMapper Implemetns Mapper<K,V,K,V> • Whatever the input it gets it gives that to output InverseMapper Implemetns Mapper<K,V,V,K> • Inverses the key,value from the input to output TokenCountMapper Implements Mapper<K,Text,Text,LongWritable> • Generates (token,1) from the input value tokenized. Should implement org.apache.hadoop.mapred.Redcuer Sorts the incoming data based on key and groups together all the values for a key Reduce function is called for every key in the sorted order ▪ void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) Reporter provides extra information of mapper to TT Reducer Functionality IdentityReducer<K,V> Implements Reducer<K,V,K,V> and maps inputs directly to outputs LongSumReducer<K> Implements Reducer<K,LongWritable,K,LongWritable> and determines the sum of all values corresponding to the given key implements Partitioner<K,V> configure() int getPartition ( … ) ▪ 0< return<no.of.reducers Generally, implement Partitioner so same keys go to one reducer Generally two kinds of files in Hadoop Text (plain , XML, html …. ) Binary (Sequence) ▪ It is a hadoop specific compressed binary file format. ▪ Optimized to transfer output from one MR to MR We can customize HDFS block size Input splits BLOCK 1 Big File is divided into multiple blocks and stored in hdfs. This is a physical division of data dfs.block.size (64MB default size) BLOCK 2 LARGE FILE BLOCK 3 BLOCK 4 Input split LOGICAL DIVISION A chunk of data processed by a mapper Further divided into records Map process these records ▪ Record = key + value How to correlate to a DB table ▪ Group of rows split ▪ Row record Key Value R1 R1 R2 R2 R3 R3 public interface InputSplit extends Writable { long getLength() throws IOException; String[] getLocations() throws IOException; } It doesn’t contain the data Only locations where the data is present Helps jobtracker to arrange tasktrackers (data locality). getLength greater length split will be executed How we get the data to mapper Inputsplits and how the splits are divided into records will be taken care by inputformat. public interface InputFormat<K, V> { InputSplit[] getSplits(JobConf job, int numSplits) throws IOException; RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException; } Mapper getRecordReader() is called to get RecordReader Once the record reader is obtained, ▪ Map method is called recursively until the end of the split K key = reader.createKey(); V value = reader.createValue(); while (reader.next(key, value)) { mapper.map(key, value, output, reporter); } JobClient running the job Gets inputsplits by calling getSplits() in InputFormat Determines data locations for the splits Sends these locations to the JobTracker JobTracker assigns mappers appropriately. ▪ Data locality Base class for all implementations of InputFormat , which uses files as input Defines Which files to include for the job Implementation for generating splits Set of Files converts to no.of splits Splits only large files…. HOW LARGE ? Larger than BlockSize Can we control it ? Property Description Default value mapred.min.split.size The smallest valid size in bytes for a file split. 1 mapred.max.split.size The largest valid size in bytes for a file split. Long.Max_value dfs.block.size The size of a block in HDFS in bytes. 64Mb mapred.min.s plit.size mapred.max.s dfs.block.size plit.size split size 1 Long.MAX 64 MB 64MB 1 Long.MAX 128MB 128 MB 128 MB Long.MAX 64MB 128MB 1 32MB 64MB 32MB • Application may impose minimum split size greater than Block Size. • There is no good reason to that • Data locality is lost Min split size We might set it to larger than block size But concept of data locality may be lost to some extent Split size calculated by formula max(minimumSize, min(maximumSize, blockSize)) By default ▪ minimumSize < blockSize < maximumSize Configure(JobConf job) Property name description mapred.input.file The path of the input file being processed mapred.input.start The byte offset of the start of the split map.input.length The length of the split in bytes Default FileInputFormat Each line is a value Byte offset is a key Example Run identity mapper program Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks. EveryFile is written is written as sequence of bytes. 64 MB reached ? then start the new block When 64 MB reached, the logical record may be half written So, the other half of logical record goes into the next HDFS block. So even in data locality some remote reading is done.. a slight overhead. Split gives logical record boundaries Blocks – physical boundaries (size) Files which are very small are inefficient in mapper phase Imagine 1GB 64Mb – 16 files – 16 mappers 100kb – 1000 files – 1000 mappers Packs many files into single split Data locality is taken into consideration MR accelerates best if operated at disk transfer rate not at seek rate This helps in processing large files also Same as TextInputFormat Each split guarenteed to have N lines mapred.line.input.format.linespermap Each line in text file is a record First separator character divides key and value Default is ‘\t’ Controller property key.value.separator.in.input.line InputFormat for reading sequence files User defined Key K User defined Value V They are splittable files. WellSuited for MR They store compression They can store arbitrary types key,values stored as \t separated by default. mapred.textoutputformat.separator -- parameter CounterPart for KeyValueTextInputFormat Can suppress key/value by using NullWritable