Personal_3.MapReduce An Introduction - hadoop

advertisement
-
Ghana
•
Understanding MapReduce
•
Map Reduce - An Introduction
• Word count – default
• Word count – custom

Programming model to process large datasets

Supported languages for MR





Java
Ruby
Python
C++
Map Reduce Programs are Inherently parallel.
 More data  more machines to analyze.
 No need to change anything in the code.

Start with WORDCOUNT example
 “Do as I say, not as I do”
Word
Count
As
2
Do
2
I
2
Not
2
Say
1
define wordCount as Map<String,long>;
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);

This works until the no.of documents to process is not
very large

Spam filter
 Millions of emails
 Word count for analysis

Working from a single computer is time
consuming

Rewrite the program to count form multiple
machines

How do we attain parallel computing ?
1. All the machines compute fraction of
documents
2. Combine the results from all the machines
STAGE 1
define wordCount as Map<String,long>;
for each document in documentSUBSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
STAGE 2
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
Display(totalWordcount)
Master
Documents
Comp-1
Comp-2
Comp-3
Comp-4
Problems
STAGE 1
• Documents segregations to be well
defined
Master
Documents
Comp-1
Comp-2
Comp-3
Comp-4
• Bottle neck in network transfer
• Data-intensive processing
• Not computational intensive
• So better store files over
processing machines
• BIGGEST FLAW
• Storing the words and count in
memory
• Disk based hash-table
implementation needed
Problems
STAGE 2
Master
•
Phase 2 has only once machine
• Bottle Neck
• Phase 1 highly distributed though
•
Make phase 2 also distributed
•
Need changes in Phase 1
• Partition the phase-1 output (say based
on first character of the word)
• We have 26 machines in phase 2
• Single Disk based hash-table should be
now 26 Disk based hash-table
• Word count-a , worcount-b,wordcount-c
Documents
Comp-1
Comp-2
Comp-3
Comp-4
Master
Documents
Comp-1
Comp-2
Comp-3
Comp-4
A
B
C
D
E
1
2
4
5
10
Comp-10
Comp-20
A
B
C
D
E
10 20 40 5
9
.
.
.
Comp-30
Comp-40

After phase-1
 From comp-1
▪
▪
▪
▪
▪

WordCount-A  comp-10
WordCount-B  comp-20
.
.
.
Each machine in phase 1 will shuffle its output to
different machines in phase 2

This is getting complicated
 Store files where are they are being processed
 Write disk-based hash table obviating RAM
limitations
 Partition the phase-1 output
 Shuffle the phase-1 output and send it to
appropriate reducer

This is more than a lot for word count

We haven’t even touched the fault tolerance
 What if comp-1 or com-10 fails

So, A need of frame work to take care of all
these things
 We concentrate only on business
Interim
output
MAPPER
REDUCER
Comp-2
Comp-3
Comp-4
Partitioning
Documents
HDFS
Comp-1
A
B
C
D
E
1
2
4
5
10
A
B
C
D
E
1
2
4
5
10
.
.
.
Shuffling
Master
Comp-10
Comp-20
Comp-30
Comp-40


Mapper
Reducer
Mapper filters and transforms the input
Reducer collects that and aggregate on that.
Extensive research is done two arrive at two
phase strategy

Mapper,Reducer,Partitioner,Shuffling
 Work together  common structure for data
processing
Input
Output
Mapper
<K1,V1>
List<K2,V2>
Reducer
<k2,list(v2)>
List<k3,v3>

Mapper
 <key,words_per_line> : Input
 <word,1> : output

Input
Output
List<K2,V2>
Reducer
Mapper
<K1,V1>
 <word,list(1)> : Input
Reducer
<k2,list(v2)> List<k3,v3>
 <word,count(list(1))> : Output

As said, don’t store the data in memory
 So keys and values regularly have to be written to
disk.
 They must be serialized.
 Hadoop provides its way of deserialization
 Any class to be key or value have to implement
WRITABLE class.
Java Type
Hadoop Serialized
Types
String
Text
Integer
IntWritable
Long
LongWritable

Let’s try to execute the following command
▪ hadoop jar hadoop-examples-0.20.2-cdh3u4.jar
wordcount
▪ hadoop jar hadoop-examples-0.20.2-cdh3u4.jar
wordcount <input> <output>

What does this code do ?

Switch to eclipse
Download