A Bloom Filter

advertisement
Advanced topics on
Mapreduce with Hadoop
Jiaheng Lu
Department of Computer Science
Renmin University of China
www.jiahenglu.net
Outline




Brief Review
Chaining MapReduce Jobs
Join in MapReduce
Bloom Filter
Brief Review


A parallel programming framework
Divide and merge
Input data
Mappers
split0
Map task
split1
split2
Shuffle
Reducers
Output data
Reduce task
output0
Reduce task
output1
Map task
Map task
Chaining MapReduce jobs



Chaining in a sequence
Chaining with complex dependency
Chaining preprocessing and postprocessing
steps
Chaining in a sequence




Simple and straightforward
[MAP | REDUCE]+; MAP+ | REDUCE | MAP*
Output of last is the input to the next
Similar to pipes
Job1
Job2
Job3
Configuration conf = getConf();
JobConf job = new JobConf(conf);
job.setJobName("ChainJob");
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, LongWritable.class,
Text.class, Text.class, Text.class, true, map1Conf);
Chaining with complex
dependency

Jobs are not chained in a linear fashion
Job1
Job2
Job3

Use addDependingJob() method to add
dependency information:
x.addDependingJob(y)
Chaining preprocessing and
postprocessing steps


Example: remove stop word in IR
Approaches:


Separate: inefficient
Chaining those steps into a single job

Use ChainMapper.addMapper() and
ChainReducer.setReducer
Map+ | Reduce | Map*
Join in MapReduce



Reduce-side join
Broadcast join
Map-side filtering and Reduce-side join



A given key
A range from dataset(broadcast)
a Bloom filter
Reduce-side join

Map



output <key, value>
key>>join key, value>>tagged with data source
Reduce


do a full cross-product of values
output the combination results
Example
table x
a
b
1
ab
1
cd
4
ef
key
map()
value
1
x ab
1
x cd
4
key
shuffle()
valuelist
1
x cd
y b
x ef
reduce()
join key
table y
a
c
1
b
2
4
tag
key
value
1
y b
d
2
y d
c
4
y c
map()
output
x ab
2
4
y d
x ef
y c
a
b
c
1
ab
b
1
cd
b
4
ef
c
Broadcast join (replicated join)

Broadcast the smaller table
Do join in Map()

Using distributed cache

DistributedCache.addCacheFile()

Map-side filtering and Reduceside join

Join key: student IDs from info




generate IDs file from info
broadcast
join
What if the IDs file can’t be stored in
memory?

a Bloom Filter
A Bloom Filter



Introduction
Implementation of bloom filter
Use in MapReduce join
Introduction to Bloom Filter


space-efficient data structure, constant size, test
elements, add(), contains()
no false negatives and a small probability of
false positives
Implementation of bloom filter



Apply a bit array
Add elements

generate k indexes

set the k bits to 1
Test elements


generate k indexes
all k bits are 1 >> true, not all are 1 >> false
Example
initial state add x(0,2,6)
false positives
√
×
add y(0,3,9) contain m(1,3,9) contain n(0,2,9)
0
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
0
2
0
2
1
2
1
2
1
2
1
3
0
3
0
3
1
3
1
3
1
4
0
4
0
4
0
4
0
4
0
5
0
5
0
5
0
5
0
5
0
6
0
6
1
6
1
6
1
6
1
7
0
7
0
7
0
7
0
7
0
8
0
8
0
8
0
8
0
8
0
9
0
9
0
9
1
9
1
9
1
①
②
③
④
⑤
Use in MapReduce join

A separate subjob to create a Bloom Filter

Broadcast the Bloom Filter and use in Map()
of join job

drop the useless record, and do join in
reduce
References


Chunk Lam, “Hadoop in action”
Jairam Chandar, “Join Algorithms using
Map/Reduce”
THANK YOU
Hadoop
Download