Presentation in Aalborg University Reverse Spatial and Textual k Nearest Neighbor Search Jiaheng Lu Renmin University of China August 11 2011 Research experience Associate Professor: Renmin University of China XML data management, Spatial data management, Cloud data management Post-doc: University of California, Irvine Data integration, Approximate string match PhD National University of Singapore XML data management Outline XML data management XML twig query processing XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search XML twig query processing XPath: Section[Title]/Paragraph//Figure Twig pattern Section Title Paragraph Figure XML twig query processing (Cont.) Problem Statement Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D. E.g. Consider Query and Document: Document: t1 t2 s1 Query: Section Query solutions: title (s1, t1, f1) (s2, t2, f1) (s1, t2, f1) s2 p1 f1 figure XML twig query processing (Cont.) Several efficient pattern matching algorithms TJFast (VLDB 05) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10) Current works: distributed XML twig pattern processing XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204 Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189 Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119 Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309 Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178 Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263 Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298 Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466 …… XML keyword search 课题背景: XQuery vs. 关键字查询 XQuery: for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings Complicated Query papers by “Mike” Keyword search: Mike,inproceedings XML keyword search The proposed keyword search returns the set of smallest trees containing all keywords. bib Keywords: Mike hobby article Paper author author 2009 name Mike ward publications inproceedings title year articles title Base line of 2002 Information Retrival XML key hobby name publications Paper John folding Hopking inproceedings year 2002 title Data Mining year 2007 article title Keyword Search in XML hobby Read book year 2009 Effectiveness Capture user’s search intention Identify the target that users intend to search for Infer the predicate constraint that user intends to search via Result ranking Rank the query results according to their objective relevance to user search intention XML keyword search Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010) Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754 Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537 Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716 …… Outline XML data management XML twig query processing XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search Motivation: Data Cleaning Should clearly be “Niels Bohr” Real-world data is dirty Typos Inconsistent representations (PO Box vs. P.O. Box) Approximately check against clean dictionary Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Motivation: Record Linkage We want to link records belonging to the same entity Phone … … … … … Age … … … … … Name Brad Pitt Arnold Schwarzeneger George Bush Angelina Jolie Forrest Whittaker No exact match! Name Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger The same entity may have similar representations Arnold Schwarzeneger Arnold Schwarzenegger versus Forrest Whittaker Forest Whittacker versus Hobbies … … … … … Address … … … … … Motivation: Query Relaxation Actual queries gathered by Google http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together What is Approximate String Search? Queries against collection: Find all entries similar to “Forrest Whitaker” Find all entries similar to “Arnold Schwarzenegger” Find all entries similar to “Brittany Spears” What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similaity - Dice - Etc. String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … … … The similar to predicate can help our described applications! How can we support these types of queries efficiently? Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine Sliding Window 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query Approximate Query Example Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne} Lookup Grams 2-grams Inverted Lists (stringIDs) … in tf vi ir ef rv 1 3 4 5 7 9 5 9 1 5 1 2 3 9 3 9 7 9 Each edit operations can “destroy” at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams ne un … 1 2 4 5 6 Candidates = {1, 5, 9} May have false positives Need to compute real similarity 5 6 9 T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold. Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for approximate member extraction using signature-based inverted lists. CIKM 2009:315-324 Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: SpaceConstrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615 Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266 Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739 …… Outline XML data management XML twig query processing XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search (SIGMOD 2011) Motivation If add a new shop at Q, which shops will be influenced? Influence facts clothes food Spatial Distance Results: D, F Textual Similarity Services/Products. .. Results: F, C clothes clothes sports food clothes 2 Problems of finding Influential Sets Traditional query Reverse k nearest neighbor query (RkNN) Our new query Reverse spatial and textual k nearest neighbor query (RSTkNN) 3 Problem Statement Spatial-Textual Similarity • describe the similarity between such objects based on both spatial proximity and textual similarity. Spatial-Textual Similarity Function 4 Problem Statement (con’t) RSTkNN query finding objects which have the query object as one of their k spatial-textual similar objects. 5 Related Work • Pre-computing the kNN for each object (Korn ect, SIGMOD2000, Yang ect, ICDE2001) • (Hyper) Voronio cell/planes pruning strategy (Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009) • 60-degree-pruning method (Stanoi ect, SIGMOD2000) • Branch and Bound (based on Lp-norm metric space) (Achtert ect, SIGMOD2006, Achtert ect, EDBT2009) Challenging Features: • Lose Euclidean geometric properties. • High dimension in text space. • k and α are different from query to query. 7 Intersection and Union R-tree (IUR-tree) N1 p1 p1 p2 p3 p4 p5 q p4 N3 p5 p2 x or d2 w 1 1 5 8 1 8 or N4 [4,4] [4,4] 1 w 1 1 5 8 1 8 d1 d1 p3 ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ or w N2 vectors [0,0] [1,1] 2 [3,1.5] [3.5,2.5] 2 IntUniVct1 or N4 y 4 1 0 2.5 1.5 2.5 w q(0.5, 2.5) x 4 0 1 3 3.5 0.5 d2 y IntVct1 1 1 IntVct2 1 1 IntUniVct3 IntVct3 1 1 [3.5, 1.5] [3.5, 1.5] UniVct1 1 1 UniVct2 UniVct3 5 8 5 8 ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 p1 p2 p3 p4 p5 w d2 [1, 0] [1, 0] N3 [3, 2.5] [3, 2.5] w N2 [0, 1] [0, 1] or N1 [4,4] [4,4] or d1 IntUniVct2 10 Overview of Search Algorithm RSTkNN Algorithm: Travel from the IUR-tree root Progressively update lower and upper bounds Apply search strategy: prune unrelated entries in Pruned; report entries to be results Ans; add candidate objects to Cnd. FinalVerification For objects in Cnd, check whether results or not by updating the bounds for candidates using expanding entries in Pruned. 14 Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 y N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 or d1 or w 1 1 5 8 1 8 d2 N1 N2 p1 p2 N3 p3 p4 p5 Initialize N4.CLs; EnQueue(U, N4); U N4, (0, 0) 15 Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 y N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q Pruned U x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 d or 1 o w 1 1 5 8 1 8 2 rd N1 N2 p1 p2 DeQueue(U, N4) EnQueue(U, N2) EnQueue(U, N3) Pruned.add(N1) N3 p3 p4 p5 Mutual-effect N1 N2 N1 N3 N2 N3 N1(0.37, 0.432) N4(0, 0.619 0) N3(0.323, ) N2(0.21, 0.619 ) 16 Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 y N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q Pruned U x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 or d1 or w 1 1 5 8 1 8 d2 N1 N2 p1 p2 p3 DeQueue(U, N3) Answer.add(p4) Candidate.add(p5) Answer N1(0.37, 0.432) N3(0.323, 0.619 ) N3 N2(0.21, 0.619 ) Candidate p5 p4 p4 Mutual-effect p5 N2 p4,N2 p4(0.21, 0.619 ) p5(0.374, 0.374) 17 Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 y N1 p1 N4 p4 q(0.5, 2.5) N4 N3 p5 p2 N2 p1 p2 p3 p4 p5 q x 4 0 1 3 3.5 0.5 x p3 y 4 1 0 2.5 1.5 2.5 vectors ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ w 1 1 5 8 1 8 or d1 or w 1 1 5 8 1 8 d2 N1 N2 p1 p2 N3 p3 DeQueue(U, N2) Answer.add(p2, p3) Pruned.add(p5) So far since U=Cand=empty, algorithm ends. Results: p2, p3, p4. Pruned U N1(0.37, 0.432) N2(0.21, 0.619 ) p5 p4 Mutual-effect p2 p4,p5 p3 p2,p4,p5 Answer Candidate p4 p2 p3 p5(0.374, 0.374) 18 Cluster IUR-tree: CIUR-tree IUR-tree: Texts in an index node could be very different. CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. p1 p1 p2 p3 p4 p5 q p4 p5 p2 C1:1 IntUniVct1 x [0,0] [1,1] 2 [3,1.5] [3.5,2.5] 2 IntVct1 1 1 IntVct2 1 1 IntVct3 1 1 d1 C2:2 IntUniVct2 C1:1, C3:1 IntUniVct3 or or N4 [4,4] [4,4] 1 d2 o r cluster w C1 1 C2 5 C2 5 C3 8 C1 1 8 d1 p3 w 1 5 5 8 1 8 d1 w N2 ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 ObjVctQ or N2 [0, 1] [0, 1] [1, 0] [1, 0] N3 [3, 2.5] [3, 2.5] [3.5, 1.5] [3.5, 1.5] ObjVct1 ObjVct2 ObjVct3 ObjVct4 ObjVct5 p1 p2 p3 p4 p5 w or N1 [4,4] [4,4] or N3 vectors w N4 y 4 1 0 2.5 1.5 2.5 w q(0.5, 2.5) x 4 0 1 3 3.5 0.5 d2 N1 d2 y UniVct1 1 1 UniVct2 UniVct3 5 8 5 8 19 Optimizations Motivation To give a tighter bound during CIUR-tree traversal To purify the textual description in the index node Outlier Detection and Extraction (ODE-CIUR) Extract subtrees with outlier clusters Take the outliers into special account and calculate their bounds separately. Text-entropy based optimization (TE-CIUR) Define TextEntropy to depict the distribution of text clusters in an entry of CIUR-tree Travel first for the entries with higher TextEntropy, i.e. more diverse in texts. 20 Experimental Study Experimental Setup Memory: 4GB baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE. Datasets CPU: 2.0GHz; Language: C/C++. Compared Methods OS: Windows XP; Page size: 4KB; ShopBranches(Shop), extended from a small real data GeographicNames(GN), real data CaliforniaDBpedia(CD), generated combining location in California and documents from DBpedia. Metric Total query time Page access number Statistics Shop CD GN Total # of objects 304,008 1,555,209 1,868,821 Total unique words in dataset 3933 21,578 222,409 Average # words per object 45 47 4 21 Scalability 0.2K (1) Log-scale version 3K 40K 550K 4M (2) Linear-scale version 22 Effect of k Query time 23 Conclusion Propose a new query problem RSTkNN. Present a hybrid index IUR-Tree. Show the enhanced variant CIURTree and two optimizations ODE-CIUR and TE-CIUR to further improve search processing. 24 Current and future works Distributed XML query processing Cloud-based data management Thank you Q&A