Reverse Spatial and Textual k Nearest Neighbor Search Jiaheng Lu, Ying Lu Gao Cong Renmin University of China Nanyang Techonlogical University, Singapore Outline • • • • • Motivation & Problem Statement Related Work RSTkNN Search Strategy Experiments Conclusion 1 Motivation • If add a new shop at Q, which shops will be influenced? clothes food • Influence facts – Spatial Distance • Results: D, F – Textual Similarity • Services/Products... • Results: F, C clothes clothes sports food clothes 2 Problems of finding Influential Sets Traditional query Reverse k nearest neighbor query (RkNN) Our new query Reverse spatial and textual k nearest neighbor query (RSTkNN) 3 Problem Statement Spatial-Textual Similarity • describe the similarity between such objects based on both spatial proximity and textual similarity. Spatial-Textual Similarity Function a b 2 EJ (a, b ) 2 || a || || b || a b 4 Problem Statement (con’t) • RSTkNN query – is finding objects which have the query object as one of their k spatial-textual similar objects. 5 Outline • • • • • Motivation & Problem Statement Related Work RSTkNN Search Strategy Experiments Conclusion 6 Related Work • Pre-computing the kNN for each object (Korn ect, SIGMOD2000, Yang ect, ICDE2001) • (Hyper) Voronio cell/planes pruning strategy (Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009) • 60-degree-pruning method (Stanoi ect, SIGMOD2000) • Branch and Bound (based on Lp-norm metric space) (Achtert ect, SIGMOD2006, Achtert ect, EDBT2009) Challenging Features: • Lose Euclidean geometric properties. • High dimension in text space. • k and α are different from query to query. 7 Baseline method For each object o in the database Precompute Spatial NNs Textual NNs Threshold Algorithm Give query q, k&α q is no more similar than o’ Object o Spatial-textual kNN o’ q is more similar than o’ Inefficient since lacking a novel data structure 8 Outline • • • • • Motivation & Problem Statement Related Work RSTkNN Search Strategy Experiments Conclusion 9 Intersection and Union R-tree (IUR-tree) N1 p1 p1 p2 p3 p4 p5 q p4 N3 p5 p2 x or d2 w 1 1 5 8 1 8 or N4 [4,4] [4,4] 1 w 1 1 5 8 1 8 d1 d1 p3 ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ or w N2 vectors [0,0] [1,1] 2 [3,1.5] [3.5,2.5] 2 IntUniVct1 or N4 y 4 1 0 2.5 1.5 2.5 w q(0.5, 2.5) x 4 0 1 3 3.5 0.5 d2 y IntVct1 1 1 IntVct2 1 1 IntUniVct3 IntVct3 1 1 [3.5, 1.5] [3.5, 1.5] UniVct1 1 1 UniVct2 UniVct3 5 8 5 8 ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 p1 p2 p3 p4 p5 w d2 [1, 0] [1, 0] N3 [3, 2.5] [3, 2.5] w N2 [0, 1] [0, 1] or N1 [4,4] [4,4] or d1 IntUniVct2 10 Main idea of Search Strategy kNNL(E) Prune an entry E in IUR-Tree, when query q is no more similar than kNNL(E). Report an entry E to be results, when query q is more similar than kNNU(E). kNNU(E) L u E q1 query objects q2 q3 11 How to Compute the Bounds Similarity approximations MinST(E, E’): o' E ' , o E, SimST (o, o' ) MinST ( E, E ' ) TightMinST(E, E’): o' E ' , o E, SimST (o, o' ) TightMinST ( E, E ' ) MaxST(E, E’): o' E ' , o E, SimST (o, o' ) MaxST ( E, E ' ) 12 Example for Computing Bounds y N1 Current traveled entries: N1, N2, N3 Given k=2, to compute kNNL(N1) and kNNU(N1). p1 q(0.5, 2.5) N4 p4 N3 effect p5 N1 N3 N1 N2 p2 N2 p3 x Compute kNNU(N1) Compute kNNL(N1) TightMinST(N1, N2) = 0.179 MinST(N1, N2) = 0.095 MaxST(N1, N3) = 0.432 MaxST(N1, N2) = 0.150 decrease MinST(N1, N3) = 0.370 decrease TightMinST(N1, N3) = 0.564 kNNU(N1) = 0.432 kNNL(N1) = 0.370 13 Overview of Search Algorithm • RSTkNN Algorithm: – Travel from the IUR-tree root – Progressively update lower and upper bounds – Apply search strategy: • • • prune unrelated entries to Pruned; report entries to be results Ans; add candidate objects to Cnd. – FinalVerification • For objects in Cnd, check whether to results or not by updating the bounds for candidates using expanding entries in Pruned. 14 Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 y N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 or d1 or w 1 1 5 8 1 8 d2 N1 N2 p1 p2 N3 p3 p4 p5 Initialize N4.CLs; EnQueue(U, N4); U N4, (0, 0) 15 Example: Execution of the RSTkNN Algorithm on IUR-tree, y given k=2, alpha=0.6 N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q Pruned U x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 d or 1 o w 1 1 5 8 1 8 2 rd N1 N2 p1 p2 DeQueue(U, N4) EnQueue(U, N2) EnQueue(U, N3) Pruned.add(N1) N3 p3 p4 p5 Mutual-effect N1 N2 N1 N3 N2 N3 N1(0.37, 0.432) N4(0, 0.619 0) N3(0.323, ) N2(0.21, 0.619 ) 16 Example: Execution of the RSTkNN Algorithm on IUR-tree, y given k=2, alpha=0.6 N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q Pruned U x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 or d1 or w 1 1 5 8 1 8 d2 N1 N2 p1 p2 p3 DeQueue(U, N3) Answer.add(p4) Candidate.add(p5) Answer N1(0.37, 0.432) N3(0.323, 0.619 ) N3 N2(0.21, 0.619 ) Candidate p5 p4 p4 Mutual-effect p5 N2 p4,N2 p4(0.21, 0.619 ) p5(0.374, 0.374) 17 Example: Execution of the RSTkNN Algorithm on IUR-tree, y given k=2, alpha=0.6 N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 or d1 or w 1 1 5 8 1 8 d2 N1 N2 p1 p2 N3 p3 DeQueue(U, N2) Answer.add(p2, p3) Pruned.add(p5) So far since U=Cand=empty, algorithm ends. Results: p2, p3, p4. Pruned U N1(0.37, 0.432) N2(0.21, 0.619 ) p5 p4 Mutual-effect p2 p4,p5 p3 p2,p4,p5 Answer Candidate p4 p2 p3 p5(0.374, 0.374) 18 Cluster IUR-tree: CIUR-tree IUR-tree: Texts in an index node could be very different. CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. p1 p1 p2 p3 p4 p5 q p4 p5 C1:1 IntUniVct1 x [0,0] [1,1] 2 [3,1.5] [3.5,2.5] 2 IntVct1 1 1 IntVct2 1 1 IntVct3 1 1 d1 C2:2 IntUniVct2 C1:1, C3:1 IntUniVct3 or or N4 [4,4] [4,4] 1 d2 o r cluster w C1 1 C2 5 C2 5 C3 8 C1 1 8 d1 p3 w 1 5 5 8 1 8 d1 w N2 ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ or N2 [0, 1] [0, 1] [1, 0] [1, 0] N3 [3, 2.5] [3, 2.5] [3.5, 1.5] [3.5, 1.5] ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 p1 p2 p3 p4 p5 w or N1 [4,4] [4,4] or N3 p2 vectors w N4 y 4 1 0 2.5 1.5 2.5 w q(0.5, 2.5) x 4 0 1 3 3.5 0.5 d2 N1 d2 y UniVct1 1 1 UniVct2 UniVct3 5 8 5 8 19 Optimizations • Motivation – To give a tighter bound during CIUR-tree traversal – To purify the textual description in the index node • Outlier Detection and Extraction (ODE-CIUR) – Extract subtrees with outlier clusters – Take the outliers into special account and calculate their bounds separately. • Text-entropy based optimization (TE-CIUR) – Define TextEntropy to depict the distribution of text clusters in an entry of CIUR-tree – Travel first for the entries with higher TextEntropy, i.e. more diverse in texts. 20 Experimental Study • Experimental Setup – – • Memory: 4GB baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE. Datasets – – – • CPU: 2.0GHz; Language: C/C++. Compared Methods – • OS: Windows XP; Page size: 4KB; ShopBranches(Shop), extended from a small real data GeographicNames(GN), real data CaliforniaDBpedia(CD), generated combining location in California and documents from DBpedia. Metric – – Total query time Page access number Statistics Shop CD GN Total # of objects 304,008 1,555,209 1,868,821 Total unique words in dataset 3933 21,578 222,409 Average # words per object 45 47 4 21 Scalability 0.2K (1) Log-scale version 3K 40K 550K 4M (2) Linear-scale version 22 Effect of k (a) Query time (b) Page access 23 Conclusion • Propose a new query problem RSTkNN. • Present a hybrid index IUR-Tree. • Present the efficient search algorithm to answer the queries. • Show the enhancing variant CIUR-Tree and two optimizations ODE-CIUR and TE-CIUR to further improve search processing. • Extensive experiments confirm the efficiency and scalability of our algorithms. 24 Reverse Spatial and Textual k Nearest Neighbor Search Thanks! Q&A A straightforward method 1. Compute RSkNN and RTkNN, respectively; 2. Combine both results of RSkNN and RTkNN get RSTkNN results. No sensible way for combination. (Infeasible)