Join Algorithms In MapReduce

advertisement
A Comparison of Join Algorithms
for Log Processing in MapReduce
Spyros Blanas, Jignesh M. Patel (University of Wisconsin-Madison)
Eugene J. Shekita, Yuanyuan Tian (IBM Almaden Research Center)
SIGMOD 2010
August 1, 2010
Presented by Hyojin Song
Contents
 Introduction
 Join Algorithms In MapReduce
 Experimental Evaluation
 Discussion
 Conclusion
2 / 30
Introduction(1/3)
 Log Processing
– Important type of data analysis commonly done with MapReduce
Log Table
– A log of events
 click-stream
 log of phone call records
 a sequence of transactions
– To compute various statistics for business insight
 filtered
 aggregated
 mined for patterns
Call records
Number
2010.09.24.14:20.30
01191655603
2010.09.24.14:30.45
01046841397
2010.09.25.19:11.118
01926540846
2010.09.28.06:40.97
01098446512
2010.09.29.08:44.08
01013461655
……
……
Reference Table
– Often needs to be join
 Log data and Reference data(user information)
3 / 30
Number
Name
01191655603
송효진
01046841397
안철수
01926540846
한효주
01098446512
안인석
01013461655
마음이
……
……
Introduction(2/3)
 MapReduce Framework
– Used to analyze large volumes of data
– The success of MapReduce
 Simple programming framework
 To manage parallelization, fault tolerance, and load balancing
– The critics of MapReduce
 lack of a schema
 lack of a declarative query language
 lack of indexes
– Difficult for joins
 Not originally designed to combine information from several data sources
 To use simple but inefficient algorithms to perform joins
4 / 30
Introduction(3/3)
 The benefits of MapReduce for log processing
– Scalability
 China Mobile gathers 5-8TB of phone call records per day
 Facebook collect almost 6TB of new log data everyday with totally 1.7PB
– Schema free
 flexibility
 a log record may also change over time
– Simple scans preferable (<-> index scans)
– Time consuming work
 gracefully fault tolerance support (<-> parallel RDBMS)
 The goal of this paper
– the implementation of several well-known join strategies in MapReduce
– comprehensive experiments to compare these join techniques
5 / 30
Contents
 Introduction
 Join Algorithms In MapReduce
 Experimental Evaluation
Problem Statement
1. Repartition Join
2. Improved Repartition Join
3. Directed Join
4. Broadcast Join
5. Semi-Join
6. Per-split Semi-Join
 Discussion
 Conclusion
6 / 30
Join Algorithms in MR
Problem Statement
 An equi-join between a log table L and a reference table R on
single column, with |L| >> |R|
 To propose further improving its performance with some
preprocessing techniques
– Well-known in the RDBMS literature
– Adapting them to MapReduce is not always straightforward
– Crucial implementation details of these join algorithms
 To implement two additional functions: init() and close()
– These are called before and after each map or reduce task
7 / 30
Join Algorithms in MR
1. Repartition Join
 The most commonly used join strategy in the MapReduce
framework
– L and R are dynamically partitioned on the join key
– The corresponding pairs of partitions are joined
– Similar to partitioned sort-merge join in the parallel RDBMS
 Example Tables(Log table & User table)
– Log table
 500,000 records
 Log has a lecture name and degree
– User table
 10,000 records
– Join key is the student ID
8 / 30
Log Table
User Table
log
Student ID
Student ID
Name
DB B+
2008-2424
2008-0909
Ahn Jaemin
KRR A
2010-8281
2010-1004
Kim Somin
Opt A-
2005-3682
2009-0078
Song Hyojin
ML C0
2009-0078
2005-3682
Lee taewhi
OS A+
2010-1004
2010-8281
An Inseok
NL D-
2008-0909
…
…
…
…
Join Algorithms in MR
1. Repartition Join
Map Phase
A split of R or L
Reduce Phase
Intermediate results
(Distributed File System)
Local disk
L
DB B 2008-2424
KRR A 2010-8281
R
Song 2009-0078
2010-8281
L: KRR A
2008-2424
L: DB B
2010-8281
R: An
2008-0909
L: NL D
2009-0078
L: ML C
2009-0078
R: Song
2005-3682
L: OPT A
An 2010-8281
…….
L
NL D 2008-0909
ML C 2009-0078
OPT A 2005-3682
.
.
.
9 / 30
Buffer
2010-8281
R: An
2008-0909
L: NL D
2010-8281
L: KRR A
2009-0078
R: Song
2005-3682
L: OPT A
2008-2424
L: DB B
2009-0078
L: ML C
Join Algorithms in MR
1. Repartition Join
Reduce Phase
Local disk
2010-8281
Buffer
L: KRR A
BR
2008-2424
2010-8281
R: An
L: DB B
BL
2010-8281
R: An
2008-0909
L: NL D
BR
2009-0078
L: ML C
2009-0078
R: Song
2005-3682
L: OPT A
Output File
(Distributed File System)
BL
2008-0909
L: NL D
2010-8281
L: KR A
2009-0078
R: Song
2005-3682
L: OPT A
2008-2424
L: DB B
2009-0078
L: ML C
10 / 30
Student ID
Name
Log
2009-0078
An In Seok
KRR A
2010-8281
Song Hyo Jin
ML C
Join Algorithms in MR
1. Repartition Join
 Standard Repartition Join
– Potential problem
 all records have to be buffered.
– May not fit in memory
 The data is highly skewed
 The key cardinality is small
– Variants of the standard repartition join are used in Pig, Hive, and Jaql
today.
 They all suffer from the buffering problem
 Improved Repartition Join
– The output key is changed to a composite of the join key and the table tag
– The partitioning & grouping function is customized
– Records from the smaller table R are buffered and L records are streamed
to generate the join output
11 / 30
Join Algorithms in MR
2. Improved Repartition Join
Map Phase
A split of R or L
Reduce Phase
Intermediate results
(Distributed File System)
Local disk
L
DB B 2008-2424
KRR A 2010-8281
R
Song 2009-0078
2010-8281 L
L: KRR A
2008-2424 L
L: DB B
2010-8281 R
R: An
2008-0909 L
L: NL D
2009-0078 L
L: ML C
2009-0078 R
R: Song
2005-3682 L
L: OPT A
An 2010-8281
…….
L
NL D 2008-0909
ML C 2009-0078
OPT A 2005-3682
.
.
.
12 / 30
Buffer
2010-8281 R
R: An
2008-0909 L
L: NL D
2010-8281 L
L: KRR A
2009-0078 R
R: Song
2005-3682 L
L: OPT A
2008-2424 L
L: DB B
2009-0078 L
L: ML C
Join Algorithms in MR
2. Improved Repartition Join
Reduce Phase
Local disk
2010-8281 L
Buffer
L: KRR A
BR
2008-2424 L
2010-8281
R: An
Output File
(Distributed File System)
L: DB B
L records are streamed
2010-8281 R
R: An
2008-0909 L
L: NL D
2009-0078 L
L: ML C
2009-0078 R
R: Song
2005-3682 L
L: OPT A
BR
2009-0078
R: Song
L records are streamed
13 / 30
Student ID
Name
Log
2009-0078
An In Seok
KRR A
2010-8281
Song Hyo Jin
ML C
Join Algorithms in MR
3. Directed Join
 Preprocessing for Repartition Join (Directed Join)
– Both L and R have already been partitioned on the join key
 Pre-partitioning L on the join key
 Then at query time, matching partitions from L and R can be directly joined
– A map-only MapReduce job.
 During the init phase, Ri is retrieved from the DFS
 To use a main memory hash table, if it’s not already in local storage
14 / 30
Join Algorithms in MR
4. Broadcast Join
 Broadcast Join
–
–
–
–
–
In most applications, |R| << |L|
Instead of moving both R and L across the network,
To broadcast the smaller table R to avoids the network overhead
A map-only job
Each map task uses a main-memory hash table for either L or R
15 / 30
Join Algorithms in MR
4. Broadcast Join
 Broadcast Join
– If R < a split of L
 To build the hash table on R
– If R > a split of L
 To build the hash table on a split of L
 Preprocessing for Broadcast Join
– Most nodes in the cluster
have a local copy of R in advance
– To avoid retrieving R
from the DFS in its init() function
16 / 30
Join Algorithms in MR
5. Semi-Join
 Semi-Join
– Some applications, |R| << |L|
 In Facebook, user table has hundreds of millions of records
 A few million unique active users per hour
– To avoid sending the records in R over the network that will not join with L
 Preprocessing for Semi-Join
– First two phases of semi-join can preprocess
17 / 30
Join Algorithms in MR
6. Per-Split Semi-Join
 Per-Split Semi-Join
– The problem of Semi-join : All records of extracted R will not join Li
– Li can be joined with Ri directly
 Preprocessing for Per-split Semi-join
– Also benefit from moving its first two phases
18 / 30
Contents
 Introduction
 Join Algorithms In MapReduce

Experimental Evaluation
 Discussion
 Conclusion
1.
2.
3.
4.
Environment
Datasets
MapReduce Time Breakdown
Experimental Results
19 / 30
Experimental Evaluation
1. Environment
 System Specification
–
–
–
–
All experiments run on a 100-node cluster
Single 2.4GHz Intel Core 2 Duo processor
4GB of DRAM and two SATA disks
Red Hat Enterprise Server 5.2 running Linux 2.6.18
 Network Specification
–
–
–
–
–
The 100 nodes were spread across two racks
Each node can execute two map and two reduce tasks concurrently
Each rack had its own gigabit Ethernet switch
The rack level bandwidth is 32Gb/s
Under full load, 35MB/s cross-rack node-to-node bandwidth
 version 0.19.0, HDFS (128MB block size)
20 / 30
Experimental Evaluation
2. Datasets
 Datasets
Event Log (L)
User Info (R)
Join column size
10 bytes
5 bytes
Record size
100bytes (average)
100 bytes (exactly)
Total size
500GB
10MB~100GB
Join result is a 10 bytes join key
n-to-1 join
many users are inactive
All the records in L always appear in the result
To fix the fraction of R that was referenced
by L to be 0.1%, 1%, or 10%
• To simulate some active users, a Zipf distribution was used
•
•
•
•
•
21 / 30
Experimental Evaluation
3. MapReduce Time Breakdown
22 / 30
Experimental Evaluation
3. MapReduce Time Breakdown
 MapReduce Time Breakdown
– What transpires during the execution of a MapReduce job
– The overhead of various execution components of MapReduce
– System Environment





The standard repartition join algorithm
500GB log table and 30MB reference table
1% actually referenced by the log records
4000 map tasks and 200 reduce tasks
A node was assigned 40 map and 2 reduce tasks
23 / 30
Experimental Evaluation
3. MapReduce Time Breakdown
 Interesting Observations on MapReduce
– The map phase was clearly CPU-bound
– The reduce phase was limited by the network bandwidth
 Writing the three copies of the join result to HDFS
– The disk and the network activities were moderate and periodic during
map phase
 The peaks were related to the output generation in the map task
 The shuffle phase in the reduce task
– Almost idle for about 30 seconds
between the 9 min and 10 min mark
 Waiting for the slowest map task
– By enabling independent and concurrent
map tasks, almost all CPU, disk and
network activities can be overlapped
24 / 30
Experimental Evaluation
4. Experimental Results
▣ No preprocessing
▣ preprocessing
25 / 30
Experimental Evaluation
4. Experimental Results
26 / 30
Contents
 Introduction
 Join Algorithms In MapReduce
 Experimental Evaluation

Discussion
 Conclusion
27 / 30
Discussion
 Choosing the Right Strategy
– To determine what is the right join strategy for a given circumstance
– To provide an important first step for query optimization
28 / 30
Contents
 Introduction
 Join Algorithms In MapReduce
 Experimental Evaluation
 Discussion
 Conclusion
29 / 30
Conclusion
 Joining log data with reference data in MapReduce has emerged
as an important part
– Analytic operations for enterprise customers
– Web 2.0 companies
 To design a series of join algorithms on top of MapReduce
– Without requiring any modification to the actual framework
– To propose many details for efficient implementation
 Two additional function: Init(), close()
 Practical preprocessing techniques
 Future work
–
–
–
–
Multi-way joins
Indexing methods to speedup join queries
Optimization module (selecting appropriate join algorithms)
New programming models to extend the MapReduce framework
30 / 30
Download