4.HDFS & IO - hadoop

advertisement
http://www.excelonlineclasses.co.nr/
excel.onlineclasses@gmail.com
http://www.excelonlineclasses.co.nr/
Excel Online Classes offers following
services:







Online Training
Development
Testing
Job support
Technical Guidance
Job Consultancy
Any needs of IT Sector
-
Nagarjuna K
•
Anatomy of MapReduce
•
•
•
•
•
•
•
MR work flow
Hadoop data types
Mapper
Reducer
Partitioner
Combiner
Input Split vs Block Size
Partitioning
Shuffling
.
I
N
P
U
T
D
A
T
A
NODE 1
Map
Interim
data
Reduce
Node to
store output
NODE 2
Map
Interim
data
Reduce
Node to
store output
NODE 2
Map
Interim
data
Reduce
Node to
store output

MR has a defined way of keys and values
types  for it to move across cluster

Values  Writable

Keys  WritableComparable<T>
 WritableComparable = Writable+Comparable<T>
Hadoop type
Wrapper for Java type
BooleanWritable
Boolean
ByteWritable
Byte
DoubleWritable
Double
IntWritable
Integer
LongWritable
Long
Text
String
NullWritable
Placeholder when key/value not needed

For any class to be value, it has to implement
org.apache.hadoop.io.Writable
 write(DataOutput out)
 readFields(DataInput in)

For any class to be key, it has to implement
org.apache.hadoop.io.WritableComparable<T>
 +
 compareTo(T o)

Check out few of the writables and writable
comparable

Time to write your own writables

Two libraries in Hadoop
 org.apache.hadoop.mapred.*
 org.apache.hadoop.mapreduce.*

Should implement
org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2>
▪ Void configure(JobConf job)
▪ All the parameters specified in the xmls are available here.
▪ Any parameter explicitly set are also available
▪ Call before data processing starts
▪ Void map (K1 key,V1 value, OutputCollector<K2,V2>
output,Reporter reporter)
▪ Data process starts
▪ Void Close()
▪ Should close any files, db connections etc.,
▪ Reporter provides extra information of mapper to TT
Mapper
Functionality
IdentityMapper
Implemetns Mapper<K,V,K,V>
• Whatever the input it gets it gives that to output
InverseMapper
Implemetns Mapper<K,V,V,K>
• Inverses the key,value from the input to output
TokenCountMapper
Implements Mapper<K,Text,Text,LongWritable>
• Generates (token,1) from the input value tokenized.

Should implement
org.apache.hadoop.mapred.Redcuer
 Sorts the incoming data based on key and groups
together all the values for a key
 Reduce function is called for every key in the sorted
order
▪ void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3>
output, Reporter reporter)
 Reporter provides extra information of mapper to TT
Reducer
Functionality
IdentityReducer<K,V>
Implements Reducer<K,V,K,V> and maps
inputs directly to outputs
LongSumReducer<K>
Implements
Reducer<K,LongWritable,K,LongWritable>
and
determines the sum of all values
corresponding to the given key

implements Partitioner<K,V>
 configure()
 int getPartition ( … )
▪ 0< return<no.of.reducers

Generally, implement Partitioner so same
keys go to one reducer

Generally two kinds of files in Hadoop
 Text (plain , XML, html …. )
 Binary (Sequence)
▪ It is a hadoop specific compressed binary file format.
▪ Optimized to transfer output from one MR to MR
 We can customize

HDFS block size

Input splits
BLOCK 1

Big File is divided
into multiple blocks
and stored in hdfs.

This is a physical
division of data

dfs.block.size
(64MB default size)
BLOCK 2
LARGE FILE
BLOCK 3
BLOCK 4

Input split
LOGICAL DIVISION
 A chunk of data processed by a mapper
 Further divided into records
 Map process these records
▪ Record = key + value
 How to correlate to a DB table
▪ Group of rows  split
▪ Row  record
Key
Value
R1
R1
R2
R2
R3
R3
public interface InputSplit extends Writable {
long getLength() throws IOException;
String[] getLocations() throws IOException;
}

It doesn’t contain the data
 Only locations where the data is present
 Helps jobtracker to arrange tasktrackers (data locality).

getLength greater length split will be executed 

How we get the data to mapper
 Inputsplits and how the splits are divided into
records will be taken care by inputformat.
public interface InputFormat<K, V> {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V> getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException;
}

Mapper
 getRecordReader() is called to get RecordReader
 Once the record reader is obtained,
▪ Map method is called recursively until the end of the
split
K key = reader.createKey();
V value = reader.createValue();
while (reader.next(key, value)) {
mapper.map(key, value, output, reporter);
}

JobClient running the job
 Gets inputsplits by calling getSplits() in
InputFormat
 Determines data locations for the splits
 Sends these locations to the JobTracker
 JobTracker assigns mappers appropriately.
▪ Data locality

Base class for all implementations of
InputFormat , which uses files as input

Defines
 Which files to include for the job
 Implementation for generating splits

Set of Files  converts to no.of splits
 Splits only large files…. HOW LARGE ?
 Larger than BlockSize

Can we control it ?
Property
Description
Default value
mapred.min.split.size
The smallest valid size in bytes for a
file split.
1
mapred.max.split.size
The largest valid size in bytes for a
file split.
Long.Max_value
dfs.block.size
The size of a block in HDFS in bytes.
64Mb
mapred.min.s
plit.size
mapred.max.s dfs.block.size
plit.size
split size
1
Long.MAX
64 MB
64MB
1
Long.MAX
128MB
128 MB
128 MB
Long.MAX
64MB
128MB
1
32MB
64MB
32MB
• Application may impose minimum split size greater than Block Size.
• There is no good reason to that
• Data locality is lost

Min split size
 We might set it to larger than block size
 But concept of data locality may be lost to some
extent

Split size calculated by formula
 max(minimumSize, min(maximumSize, blockSize))
 By default
▪ minimumSize < blockSize < maximumSize

Configure(JobConf job)
Property name
description
mapred.input.file
The path of the input file being
processed
mapred.input.start
The byte offset of the start of the split
map.input.length
The length of the split in bytes

Default FileInputFormat
 Each line is a value
 Byte offset is a key

Example
 Run identity mapper program

Logical Records defined by FileInputFormat
doesn’t usually fit it into HDFS blocks.
 EveryFile is written is written as sequence of bytes.
 64 MB reached ? then start the new block
 When 64 MB reached, the logical record may be half
written
 So, the other half of logical record goes into the next
HDFS block.

So even in data locality some remote reading
is done.. a slight overhead.
 Split gives logical record boundaries
 Blocks – physical boundaries (size)


Files which are very small are inefficient in
mapper phase
Imagine 1GB
 64Mb – 16 files – 16 mappers
 100kb – 1000 files – 1000 mappers 

Packs many files into single split
 Data locality is taken into consideration

MR accelerates best if operated at disk
transfer rate not at seek rate

This helps in processing large files also

Same as TextInputFormat

Each split guarenteed to have N lines

mapred.line.input.format.linespermap


Each line in text file is a record
First separator character divides key and
value
 Default is ‘\t’

Controller property
 key.value.separator.in.input.line

InputFormat for reading sequence files


User defined Key K
User defined Value V

They are splittable files.
 WellSuited for MR
 They store compression
 They can store arbitrary types

key,values stored as \t separated by default.
 mapred.textoutputformat.separator -- parameter
CounterPart for KeyValueTextInputFormat

Can suppress key/value by using NullWritable
Download