Efficient Processing of k Nearest Neighbor Joins using MapReduce INTRODUCTION • k nearest neighbor join (kNN join) is a special type of join that combines each object in a dataset R with the k objects in another dataset S that are closest to it. • As a combination of the k nearest neighbor (kNN) query and the join operation, kNN join is an expensive operation. • Most of the existing work rely on some centralized indexing structure such as the B+-tree and the R-tree, which cannot be accommodated in such a distributed and parallel environment directly. AN OVERVIEW OF KNN JOIN USING MAPREDUCE • basic strategy:R=U1≤i≤N Ri, where Ri∩Rj = ∅, i ≠ j; each subset Ri is distributed to a reducer. S has to be sent to each reducer to be joined with Ri; finally R∝S = U1≤i≤N Ri∝ S. |R|+N·|S|. • H-BRJ: splits both R and S into √n R=U1≤i≤ √n Ri S=U1≤i≤ √nSi. • Better strategy: Ri∝S=Ri∝Si and R∝S=U1≤i≤NRi∝Si. |R|+α·|S| AN OVERVIEW OF KNN JOIN USING MAPREDUCE • In summary, for the purpose of minimizing the join cost, we need to 1. find a good partitioning of R; 2. find the minimal set of Si for each Ri ∈ R, given a partitioning of R. ※ The minimum set of Si is Si =U1≤j≤|Ri|KNN(ri, S). However,it is impossible to find out the k nearest neighbors for all ri apriori. HANDLING KNN JOIN USING MAPREDUCE DATA PREPROCESSING • A good partitioning of R for optimizing kNN join should cluster objects based on their proximity. • Random Selection • Farthest Selection • k-means Selection ※ It is not easy to find pivots. First MapReduce Job • perform data partitioning and collect some statistics for each partition. Second MapReduce Job • Distance Bound of kNN ub(s,PiR) = U(PiR) + |pi,pj| + |pj,s| θi= max |ub(s, PiR )| ∀s∈KNN(PiR,S) ① Second MapReduce Job • Finding Si for Ri lb(s, PiR ) = max{0, |pi, pj| − U(PiR ) − |s, pj |} ② if (lb(s, PiR )>θi) ③ then sKNN(PiR,S) LB(PjS,PiR) = |pi, pj|- U(PiR ) -θi if (|s,pj| ≥LB(PjS,PiR)) then sKNN(PiR,S) s ∈ [LB(PjS,PiR),U(PjS)] Second MapReduce Job • In this way, objects in each partition of R and their potential k nearest neighbors will be sent to the same reducer. By parsing the key value pair (k2, v2), the reducer can derive the partition PiR and subset Si that consists of Pj1S , . . . ,PjMS • ∀r ∈ PiR , in order to reduce the number of distance computations, we first sort the partitions from Si by the distances from their pivots to pivot pi in the ascending order. ※ compute θi ← max∀s∈KNN(PRi,S)|ub(s,PRi )| ※ Refine θi but I think it is useless. Second MapReduce Job • define d(o,HP(pi, pj)) = | o , pi | | o, pj | 2 2 | pi, pj | 2 . if d(o,HP(pi, pj)) > θ then ∀q∈PiR |o,q|> θ if max{L(PiS), |pi, q| − θ} ≤ |pi,o| ≤ min{U(PiO ), |pi, q|+ θ} then |q, o| ≤ θ MINIMIZING REPLICATION OF S • |s, pj| ≥ LB(PjS, PiR ) => large LB(PjS, PiR) keep small |s, pj| =>split the dataset into finer granularity and the bound of the kNN distances for all objects in each partition of R will become tighter. • R =U1≤i≤N Gi, Gi ∩ Gj = ∅, i = j. s is assigned to Si only if |s, pj| ≥ LB(PjS, Gi ). where LB(PjS, Gi ) = min P ∈G LB(PjS, PiR ) ∀ R i i RP(S) =∑∀Gi∑∀P |{s|s ∈ PjS∧ |s, pj| ≥ LB(PjS ,Gi)}| S j MINIMIZING REPLICATION OF S • Geometric Grouping • Greedy Grouping minimize the size of RP(S,Gi ∪ {PjR}) − RP(S,Gi) but it is rather cost, so ∃s ∈ PSl , |s, pj| ≤ LB(PjS ,Gi) RP(S,Gi) ≈∀P ⊂S{PjS |LB(PjS ,Gi) ≤ U(PjS )} S j EXPERIMENTAL EVALUATION EXPERIMENTAL EVALUATION EXPERIMENTAL EVALUATION EXPERIMENTAL EVALUATION EXPERIMENTAL EVALUATION The End! Thanks