slides

advertisement
Efficient Processing of
k Nearest Neighbor
Joins using
MapReduce
INTRODUCTION
• k nearest neighbor join (kNN join) is a special type of join
that combines each object in a dataset R with the k objects
in another dataset S that are closest to it.
• As a combination of the k nearest neighbor (kNN) query
and the join operation, kNN join is an expensive operation.
• Most of the existing work rely on some centralized
indexing structure such as the B+-tree and the R-tree,
which cannot be accommodated in such a distributed and
parallel environment directly.
AN OVERVIEW OF KNN JOIN
USING MAPREDUCE
• basic strategy:R=U1≤i≤N Ri, where Ri∩Rj = ∅, i ≠ j; each
subset Ri is distributed to a reducer. S has to be sent to
each reducer to be joined with Ri; finally R∝S = U1≤i≤N
Ri∝ S. |R|+N·|S|.
• H-BRJ: splits both R and S into √n R=U1≤i≤ √n Ri S=U1≤i≤
√nSi.
• Better strategy: Ri∝S=Ri∝Si and R∝S=U1≤i≤NRi∝Si.
|R|+α·|S|
AN OVERVIEW OF KNN JOIN
USING MAPREDUCE
• In summary, for the purpose of minimizing the join cost,
we need to
1. find a good partitioning of R;
2. find the minimal set of Si for each Ri ∈ R, given a
partitioning of R.
※ The minimum set of Si is Si =U1≤j≤|Ri|KNN(ri, S). However,it
is impossible to find out the k nearest neighbors for all ri
apriori.
HANDLING KNN JOIN USING
MAPREDUCE
DATA PREPROCESSING
• A good partitioning of R for optimizing kNN join should
cluster objects based on their proximity.
• Random Selection
• Farthest Selection
• k-means Selection
※ It is not easy to find pivots.
First MapReduce Job
• perform data partitioning and collect some statistics for
each partition.
Second MapReduce Job
• Distance Bound of kNN
ub(s,PiR) = U(PiR) + |pi,pj| + |pj,s|
θi= max
|ub(s, PiR )|
∀s∈KNN(PiR,S)
①
Second MapReduce Job
• Finding Si for Ri
lb(s, PiR ) = max{0, |pi, pj| − U(PiR ) − |s, pj |}
②
if (lb(s, PiR )>θi) ③
then sKNN(PiR,S)
LB(PjS,PiR) = |pi, pj|- U(PiR ) -θi
if (|s,pj| ≥LB(PjS,PiR))
then sKNN(PiR,S)
s ∈ [LB(PjS,PiR),U(PjS)]
Second MapReduce Job
• In this way, objects in each partition of R and their
potential k nearest neighbors will be sent to the
same reducer. By parsing the key value pair (k2, v2),
the reducer can derive the partition PiR and subset
Si that consists of Pj1S , . . . ,PjMS
• ∀r ∈ PiR , in order to reduce the number of distance
computations, we first sort the partitions from Si by
the distances from their pivots to pivot pi in the
ascending order.
※ compute θi ← max∀s∈KNN(PRi,S)|ub(s,PRi )|
※ Refine θi but I think it is useless.
Second MapReduce Job
• define d(o,HP(pi, pj)) =
| o , pi |  | o, pj |
2
2  | pi, pj |
2
.
if d(o,HP(pi, pj)) > θ
then ∀q∈PiR |o,q|> θ
if max{L(PiS), |pi, q| − θ} ≤
|pi,o| ≤ min{U(PiO ), |pi, q|+ θ}
then |q, o| ≤ θ
MINIMIZING REPLICATION OF
S
• |s, pj| ≥ LB(PjS, PiR ) => large LB(PjS, PiR) keep small |s, pj|
=>split the dataset into finer granularity and the bound of the
kNN distances for all objects in each partition of R will become
tighter.
• R =U1≤i≤N Gi, Gi ∩ Gj = ∅, i = j.
s is assigned to Si only if |s, pj| ≥ LB(PjS, Gi ).
where LB(PjS, Gi ) = min P ∈G LB(PjS, PiR )
∀
R
i
i
RP(S) =∑∀Gi∑∀P |{s|s ∈ PjS∧ |s, pj| ≥ LB(PjS ,Gi)}|
S
j
MINIMIZING REPLICATION OF
S
• Geometric Grouping
• Greedy Grouping
minimize the size of RP(S,Gi ∪ {PjR}) − RP(S,Gi)
but it is rather cost, so ∃s ∈ PSl , |s, pj| ≤ LB(PjS ,Gi)
RP(S,Gi) ≈∀P ⊂S{PjS |LB(PjS ,Gi) ≤ U(PjS )}
S
j
EXPERIMENTAL EVALUATION
EXPERIMENTAL EVALUATION
EXPERIMENTAL EVALUATION
EXPERIMENTAL EVALUATION
EXPERIMENTAL EVALUATION
The End!
Thanks
Download