Efficient Parallel Set-Similarity Joins Using MapReduce

advertisement
Efficient Parallel Set-Similarity Joins
Using MapReduce
Rares Vernica, Michael J. Carey, Chen Li
Speaker : Razvan Belet
Outline
• Motivating Scenarios
• Background Knowledge
• Parallel Set-Similarity Join
– Self Join
– R-S Join
• Evaluation
• Conclusions
• Strengths & Weaknesses
Scenario: Detecting Plagiarism
• Before publishing a Journal, editors have to make sure
there is no plagiarized paper among the hundreds of
papers to be included in the Journal
Scenario: Near-duplicate
elimination
• The archive of a search engine can contain multiple
copies of the same page
• Reasons: re-crawling, different hosts holding the same
redundant copies of a page, etc.
Problem Statement
Problem Statement:
Given two collections of
objects/items/records, a similarity
metric sim(o1,o2) and a threshold
λ , find the pairs of
objects/items/records satisfying
sim(o1,o2)> λ
Solution:
• Similarity Join
Motivation(2)
• Some of the collections are enormous:
– Google N-gram database : ~1trillion records
– GeneBank : 416GB of data
– Facebook : 400 million active users
Try to process this data in
a parallel, distributed way
=> MapReduce
Outline
• Motivating Scenarios
• Background Knowledge
• Parallel Set-Similarity Join
– Self Join
– R-S Join
• Evaluation
• Conclusions
Background Knowledge
• Set-Similarity Join
• Join
• Similarity Join
• Set-Similarity Join
Background Knowledge: Join
• Logical operator heavily used in Databases
• Whenever it is needed to associate records in 2 tables
=> use a JOIN
• Associates records in the 2 input tables based on a
predicate (pred)
Consider this information need: for each
employee find the department he
works in
Table Employees
LastName DepartmentID
Rafferty
31
Jones
33
Steinberg
33
Robinson
34
Smith
34
John
NULL
Table Departments
DepartmentID DepartmentName
31
Sales
33
Engineering
34
Clerical
35
Marketing
Background Knowledge: Join
•
Example :For each employee find the department he works in
EMPLOYEES
LastName
DepID
Rafferty
31
Jones
33
Steinberg
33
Robinson
34
Smith
34
John
NULL
DEPARTMENTS
JOINpred
Department
ID
DepartmentNa
me
31
Sales
33
Engineering
34
Clerical
35
Marketing
pred:
EMPLOYEES.DepID=
DEPARTMENTS.DerpartmentI
D
JOIN RESULT
LastName
DepartmentName
Rafferty
Sales
Jones
Engineering
Steinberg
Engineering
…
…
Background Knowledge: Similarity Join
• Special type of join, in which the predicate (pred) is a
similarity metric/function: sim(obj1,obj2)
T1:
T2:
a b c
… … …
… … ...
Similarity Joinpred
d e c
pred:
sim(T1.c,T2.c)>threshold
… … …
… … ...
a b c d e
…
…
…
…
…
…
…
…
• Return pair (obj1, ob2) if pred holds:
sim(obj1,obj2) > threshold
…
…
…
...
…
…
…
…
…
…
…
…
Background Knowledge: Similarity Join
• Examples of sim(obj1,obj2) functions:
# of common words
sim(paper1,paper2) =
# totalwords in the2 papers
sim(Si, Tj) 
|SiTj| ,
|SiTj|
Si, most common words in page i
Tj, most common words in page j
Similarity Join
• sim(obj1,obj2) obj1,obj2 : documents, records in DB
tables, user profiles, images, etc.
• Particular class of similarity joins:
(string/text-) similarity join:obj1, obj2 are strings/texts
a b c
…
…
…
…
…
…
…
…
…
…
…
...
Name
John W. Smith
Marat Safin
Rafael P. Nadal
…
SimilarityJoinpred
pred:
sim(T1.Name, T2.Name) > 2
d e
…
…
…
…
Name
…
Smith, John
… Safin, Marat Michailowitsch
…
Nadal , Rafael Parera
...
….
sim(T1.Name,T2.Name)=#common words
• Many real-world application => of particular interest
Set-Similarity Join(SSJoin)
• SSJoin: a powerful primitive for supporting
(string-)similarity joins
• Input: 2 collections of sets
• Goal: Identify all pairs of highly similar sets
{word1,word2
….….
wordn}
S1={…
}
S2={…
}
….
Sn={…
}
SSJoinpred
pred: sim(Si,Ti)>0.3
|SiTi |
sim(Si, Ti) 
|SiTi |
T1={…}
T2={…}
…
Tn={…}
{word1,word2
….….
wordn}
Set-Similarity Join
• How can a (string-)similarity join be
reduced to a SSJoin?
SSJoin
BasedOn
SimilarityJoin
• Example:
a b c
…
…
…
…
…
…
…
…
Name
… {John, W., Smith}
…
{Marat, Safin}
… {Rafael, P., Nadal}
...
…
SSJoinpred
d e
Name
… …
… …
{Smith, John}
{Safin, Marat,
Michailowitsch}
{Nadal , Rafael, Parera}
….
… …
… ...
pred:
sim(T1.Name, T2.Name) > 0.5
sim(Si, Ti) 
|SiTi |
|SiTi |
Set-Similarity Join
• Most SSJoin algorithms are signature-based:
INPUT: Set collections R and S and threshold λ
1. For each r  R, generate signature-set Sign(r)
Filtering phase
2. For each s S, generate signature-set Sign(s)
3. Generate all candidate pairs (r, s), rR,sS satisfying
Sign(r) ∩ Sign(s)  
4. Output any candidate pair (r, s) satisfying Sim(r, s) ≥ λ.
Post-filtering
phase
Set-Similarity Join
• Signatures:
– Have a filtering effect: SSJoin algorithm compares
only candidates not all pairs (in post-filtering phase)
– Give the efficiency of the SSJoin algorithm: the
smaller the number of candidate pairs, the better
– Ensure correctness: Sign(r) ∩ Sign(s)
Sim(r, s) ≥ λ;
 , whenever
Set-Similarity Join : Signatures Example
•
One possible signature scheme: Prefix-filtering
•
Compute Global Ordering of Tokens:
Marat …W. Safin ... Rafael ... Nadal ...P. … Smith …. John
•
Compute Signature of each input set: take the prefix of
length n
a b c
…
…
…
…
…
…
…
…
Name
… {John, W., Smith}
…
{Marat, Safin}
… {Rafael, P., Nadal}
...
…
Sign({John, W., Smith})=[W., Smith]
Sign({Marat,Safin})=[Marat, Safin]
Sign({Rafael, P., Nadal})=[Rafael,Nadal]
Set-Similarity Join
• Filtering Phase: Before doing the actual
SSJoin, cluster/group the candidates
a b c
Name
… … … {John, W., Smith}
… … … {Marat, Safin}
{Rafael, P., Nadal}
… … ...
…
d e
Name
… … {Smith, John}
… … {Safin,Marat,Michailowitsc}
{Nadal , Rafael, Parera}
… ...
….
…
cluster/bucket1
cluster/bucket2
cluster/bucketN
• Run the SSjoin on each cluster => less workload
Outline
• Motivating Scenarios
• Background Knowledge
• Parallel Set-Similarity Join
– Self Join
– R-S Join
• Evaluation
• Conclusions
• Strengths & Weaknesses
Parallel Set-Similarity Join
• Method comprises 3 stages:
Compute data
statistics for
good signatures
Stage I:
Token Ordering
Group candidates
based on signature
&
Compute SSJoin
Stage II
RID-Pair Generation
Generate actual
pairs of
joined records
Stage III:
Record Join
Explanation of input data
• RID = Row ID
• a : join column
•“A B C” is a string:
•Address: “14th Saarbruecker Strasse”
•Name: “John W. Smith”
Stage I: Data Statistics
Compute data
statistics for
good signatures
Stage I:
Token Ordering
Basic Token
Ordering
Group candidates
based on signature
&
Compute SSJoin
Stage II
RID-Pair Generation
One Phase
Token Ordering
Generate actual
pairs of
joined records
Stage III:
Record Join
Token Ordering
• Creates a global ordering of the tokens in
the join column, based on their frequency
a
RID
1
2
Global Ordering:
(based on
frequency)
A B D AA
BBDAE
E
1
D
2
b
c
…
…
…
…
B
3
A
4
Basic Token Ordering(BTO)
• 2 MapReduce cycles:
– 1st : computing token frequencies
– 2nd: ordering the tokens by their frequencies
Basic Token Ordering – 1st MapReduce cycle
,,
map:
• tokenize the join
value of each record
• emit each token
with no. of occurrences 1
reduce:
• for each token, compute total
count (frequency)
Basic Token Ordering – 2nd MapReduce cycle
reduce(use only 1 reducer):
map:
• emits the value
• interchange key
with value
One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):
– Uses only one MapReduce Cycle (less I/O)
– In-memory token sorting, instead of using a
reducer
OPTO – Details
,,
map:
• tokenize the join
value of each record
• emit each token
with no. of occurrences 1
Use tear_down
method to order
the tokens in
memory
reduce:
• for each token, compute
count (frequency)
Stage II: Group Candidates & Compute SSJoin
Individual Tokens
Grouping
Compute data
statistics for
good signatures
Stage I:
Token Ordering
Grouped Tokens
Grouping
Group candidates
based on signature
&
Compute SSJoin
Stage II
RID-Pair Generation
Basic Kernel
PPJoin
Generate actual
pairs of
joined records
Stage III:
Record Join
RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
satisfying the join predicate(sim)
• consists of only one MapReduce cycle
Global ordering of tokens obtained in the
previous stage
RID-Pair Generation: Map Phase
• scan input records and for each record:
– project it on RID & join attribute
– tokenize it
– extract prefix according to global ordering of tokens obtained in the
Token Ordering stage
– route tokens to appropriate reducer
Grouping/Routing Strategies
• Goal: distribute candidates to the right
reducers to minimize reducers’ workload
• Like hashing (projected)records to the
corresponding candidate-buckets
• Each reducer handles one/more
candidate-buckets
• 2 routing strategies:
Using Individual Tokens
Using Grouped Tokens
Routing: using individual tokens
(projected) record
• Treats each token as a key token
• For each record, generates a (key, value) pair
for each of its prefix tokens:
Example:
• Given the global ordering:
Token
A
B
E
D
G
C
F
Frequency
10
10
22
23
23
40
48
“A B C”
=> prefix of length 2: A,B
=> generate/emit 2 (key,value) pairs:
• (A, (1,A B C))
• (B, (1,A B C))
Grouping/Routing: using individual tokens
• Advantage:
– high quality of grouping of candidates( pairs of
records that have no chance of being similar,
are never routed to the same reducer)
• Disadvantage:
– high replication of data (same records might
be checked for similarity in multiple reducers,
i.e. redundant work)
Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
(different tokens can be mapped to the same
key)
• For each record, generates a (key, value) pair
for each the groups of the prefix tokens:
Routing: Using Grouped Tokens
Example:
• Given the global ordering:
Token
A
B
E
D
G
C
F
Frequency
10
10
22
23
23
40
48
“A B C” => prefix of length 2: A,B
Suppose A,B belong to group X and
C belongs to group Y
=> generate/emit 2 (key,value) pairs:
• (X, (1,A B C))
• (Y, (1,A B C))
Grouping/Routing: Using Grouped Tokens
• The groups of tokens (X,Y) are formed assigning
tokens to groups in a Round-Robin manner
Token
A
B
E
D
G
C
F
Frequency
10
10
22
23
23
40
48
AD F
BG
Group1
Group2
EC
Group3
• Groups will be balanced w.r.t the sum of
frequencies of token belonging to one specific
group
Grouping/Routing: Using Grouped Tokens
• Advantage:
– Replication of data is not so pervasive
• Disadvantage:
– Quality of grouping is not so high (records
having no chance of being similar are sent to
the same reducer which checks their
similarity)
RID-Pair Generation: Reduce Phase
• This is the core of the entire method
• Each reducer processes one/more buckets
• In each bucket, the reducer looks for pairs of join attribute
values satisfying the join predicate
If the similarity of the 2 candidates >= threshold
=> output their ids and also their similarity
Bucket of
candidates
RID-Pair Generation: Reduce Phase
• Computing similarity of the candidates in a
bucket comes in 2 flavors:
• Basic Kernel : uses 2 nested loops to verify each
pair of candidates in the bucket
• Indexed Kernel : uses a PPJoin+ index
RID-Pair Generation: Basic Kernel
• Straightforward method for finding candidates
satisfying the join predicate
• Quadratic complexity : O(#candidates2)
reduce:
foreach candidate in bucket
for each cand in bucket\{candidate}
if sim(candidate,cand)>= threshold
emit((candidateRID, candRID), sim)
RID-Pair Generation:PPJoin+
• Uses a special index data structure
• Not so straightforward to implement
• Much more efficient
reduce:
probe PPJoinIndex with join attr value of current_candidate
=> a list RIDs satisfying the join predicate
add the current_candidate to the PPJoinIndex
Stage III: Generate pairs of joined records
Compute data
statistics for
good signatures
Stage I
Group candidates
based on signature
&
Compute SSJoin
Generate actual
pairs of
joined records
Stage II
Basic Record Join
Stage III
One Phase
Record Join
Record Join
• Until now we have only pairs of RIDs, but we need
actual records
• Use the RID pairs generated in the previous stage to
join the actual records
• Main idea:
– bring in the rest of the each record (everything excepting the
RID which we already have)
• 2 approaches:
– Basic Record Join (BRJ)
– One-Phase Record Join (OPRJ)
Record Join: Basic Record Join
• Uses 2 MapReduce cycles
– 1st cycle: fills in the record information for each half of each pair
– 2nd cycle: brings together the previously filled in records
Record Join: One Phase Record Join
• Uses only one MapReduce cycle
R-S Join
• Challenge: We now have 2 different record sources =>
2 different input streams
• Map Reduce can work on only 1 input stream
• 2nd and 3rd stage affected
• Solution: extend (key, value) pairs so that it includes a
relation tag for each record
Outline
• Motivating Scenarios
• Background Knowledge
• Parallel Set-Similarity Join
– Self Join
– R-S Join
• Evaluation
• Conclusions
• Strengths & Weaknesses
Evaluation
• Cluster: 10-node IBM x3650, running Hadoop
• Data sets:
• DBLP: 1.2M publications
• CITESEERX: 1.3M publication
• Consider only the header of each paper(i.e author, title,
date of publication, etc.)
• Data size synthetically increased (by various factors)
• Measure:
• Absolute running time
• Speedup
• Scaleup
Self-Join running time
• Best algorithm: BTO-PKOPRJ
• Most expensive stage: the
RID-pair generation
Self-Join Speedup
• Fixed data size, vary the
cluster size
• Best time: BTO-PKOPRJ
Self-Join Scaleup
• Increase data size and
cluster size together by
the same factor
• Best time: BTO-PKOPRJ
R-S Join Performance
• Mostly, the same behavior
R-S Join Performance
Outline
• Motivating Scenarios
• Background Knowledge
• Parallel Set-Similarity Join
– Self Join
– R-S Join
• Evaluation
• Conclusions
• Strengths & Weaknesses
Conclusions
• Efficient way of computing Set-Similarity Join
• Useful in many data cleaning scenarios
• SSJoin and MapReduce: one solution for huge
datasets
• Very efficient when based on prefix-filtering and
PPJoin+
• Scales-up up nicely
Strengths & Weaknesses
• Strengths:
–
–
–
–
More efficient than single-node/local SSJoin
Failure safer than single-node SSJoin
Uses powerful filtering methods (routing strategies)
Uses PPJoinIndex (data structure optimized for SSJoin)
• Weaknesses:
– This implementation is applicable only to string-based input
data
– Supposes the dictionary and RID-pairs list fit in main memory
– Repeated tokenization
– Evaluation based on synthetically increased data
Questions
Thank you!
Download