Developing 3-in-1 Index Structures on Complex Structure Similarity Search Presented by Xiaoli Wang Supervisor: Anthony K. H. Tung 2015/4/13 SOC@NUS 1 Thesis defense by Xiaoli Wang Outline Motivation Complex structures Limitations on existing systems Our solutions Inverted index for graph similarity search Inverted index for sequence similarity search A 3-in-1 inverted indexing tool and its applications Conclusion and future work 2015/4/13 SOC@NUS 2 Thesis defense by Xiaoli Wang Why complex structures? Relational database Data model: table Real objects Sequence, tree, and graph Actor Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samual Jackson Iron man 2008 Sci-Fi Samuel Jackson The man 2006 Crime Chemical compound <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> 2015/4/13 Protein structure Program flow Coil Image Fingerprint SOC@NUS Letter Shape 3 Thesis defense by Xiaoli Wang How to manage complex structures? Conventional indexing and searching systems Restricted to a given data model Hard to adapt to support various complex structures Waste resource to re-design various storage systems Access method Access method Access method Like query Sequence database Tree database Graph database Storage … Biology data 2015/4/13 Documents Program flow SOC@NUS Image 4 Thesis defense by Xiaoli Wang Real application: social reading systems Book duplicate detection problem 2015/4/13 SOC@NUS 5 Thesis defense by Xiaoli Wang Real application: social reading systems Book duplicate detection problem Graph solution[1] 1. Adam Schenker. Graph-theretic techniques for web content mining. PhD thesis. University of South Florida. 2003. 2015/4/13 SOC@NUS 6 Thesis defense by Xiaoli Wang Real application: social reading systems Book duplicate detection problem Book annotation migration problem Sequence solution[1] 1. Xiaoli Wang, etc. Efficient and Effective KNN Sequence Search with Approximate n-grams. In PVLDB. 2013. 2015/4/13 SOC@NUS 7 Thesis defense by Xiaoli Wang Real application: social reading systems Book duplicate detection problem Book annotation migration problem Annotation search by snapping problem Sequence solution[1] 1. Xiaoli Wang, etc. Efficient and Effective KNN Sequence Search with Approximate n-grams. In PVLDB. 2013. 2015/4/13 SOC@NUS 8 Thesis defense by Xiaoli Wang Real application: social reading systems Book duplicate detection problem Graph matching problem Book annotation migration problem Sequence matching problem Annotation search by snapping problem Sequence similarity search problem Build an information management tool to support the storage and retrieval of sequences and graphs 2015/4/13 SOC@NUS 9 Thesis defense by Xiaoli Wang Objectives Building a 3-in-1 unified indexing system Design a unified and effective indexing mechanism Support similarity search on various complex structures Scope of this thesis Focus on the similarity search problems Select edit distance as similarity measure Tree is considered as a specific graph 2015/4/13 SOC@NUS 10 Thesis defense by Xiaoli Wang Objectives Building a 3-in-1 unified indexing system Design a unified and effective indexing mechanism Support storage and retrieval of all complex structures Step 1: Propose an effective index to support efficient graph similarity search Step 2: Extend the above index to further support efficient sequence similarity search Step 3: Design and implement the unified 3-in-1 index to support various complex structure search 2015/4/13 SOC@NUS 11 Thesis defense by Xiaoli Wang Outline Motivation Complex structures Limitations on existing systems Our solutions Inverted index for graph similarity search Inverted index for sequence similarity search A 3-in-1 inverted indexing tool and its applications Conclusion and future work 2015/4/13 SOC@NUS 12 Thesis defense by Xiaoli Wang Graph similarity search problem Given a graph database, we want to find its most similar graphs based on graph edit distance A query graph a c Similar graphs to the query b a c a c a c a b b d b e d f c …… a c b A graph database Graph edit distance: λ(g1, g2) = minimum # of operations (insertion, deletion, substitution) to transform graph g1 into graph g2 2015/4/13 SOC@NUS 13 Thesis defense by Xiaoli Wang Existing works A naive solution: No pruning and poor scalability 100,000 full scan candidates 100,000 GED computation answers 50 C-Star (VLDB'09): Too many GED bound computations exact graph edit GED bound computationUnpractical GED computation candidates answers distance computations full scan 100,000 100 50 K-AT (TKDE'10): Loose bound – too many false positives Expensive GED bound Tighter bound GED bound computation GED computation answers computation candidates 100,000 70% scan index 90,000 50 Loose GED bound 2015/4/13 SOC@NUS 14 Thesis defense by Xiaoli Wang Our solution Build an index to reduce scan ratio Use tighter GED bounds to reduce candidate size GED bound computation 100,000 5% scan index candidates GED computation 100 answers 50 Inverted index based on the star decomposition The graph is decomposed into smaller stars Inverted index is designed to store the reference between graphs and stars Efficient CA-based search framework is proposed 2015/4/13 SOC@NUS 15 Thesis defense by Xiaoli Wang Star decomposition Star structure A star structure is a labelled, single-level, rooted tree which can be represented by a 3-tuple s = (r, L, l), where r is the root vertex, L is the set of leaves and l is a labelling function Star representation for a graph A graph can be broken into a multi-set of star structures c b a c b a c g1 2015/4/13 c c b a c b a c c SOC@NUS a c S(g1) 16 Thesis defense by Xiaoli Wang Star decomposition Star edit distance Given two star structures s1 and s2, λ(s1, s2) = T(r1, r2) + d(L1, L2) Where T(r1, r2) = 0 if l(r1) = l(r2); otherwise T(r1, r2) = 1 d(L1, L2) = ||L1| − |L2|| + M(L1, L2) M(L1, L2) = max{| ΨL1|, | ΨL2|} − |ΨL1∩ΨL2| Example: given s1 = abcc, and s2 = dcc, T(r1, r2) = 1, as l(a) ≠ l(d); d(L1, L2) = |3-2| + 3 – 2 = 2; λ(s1, s2) = 1 + 2 = 3. 2015/4/13 a b SOC@NUS c s1 d c c c s2 17 Thesis defense by Xiaoli Wang Mapping distance: distance between two star representations a a c c g2 g1 b ε ε b d s0 ε s1 s0 0 s2 3 2 0 0 1 0 0 s4 3 s5 1 2 0 0 2 0 0 s0 s0 0 s2 3 s1 2 s3 2 ε 5 1 2 5 s4 3 s5 1 2 1 5 2 2 5 b a a 0 a c s0 b d s2 1 a c d s4 1 a b a c s0 b c s1 c b s3 s5 b d c S(g1) 5 ε S(g2) The mapping distance: μ(g1, g2) = 0 + 1 + 1 + 5 = 7 2015/4/13 SOC@NUS 18 Thesis defense by Xiaoli Wang Observations on star decomposition The lower bound based on mapping distance is shown to be tight, and can be computed in cubic time If we use the star structure as the filtering feature in a feature-based inverted index, then the derived tighter bounds can be applied for graph pruning Graphs have more similar stars can be more similar to each other If we can have a method to access database graphs in an increasing order of edit distance to a query graph, then a CA-based filtering framework can work well to reduce the scan ratio 2015/4/13 SOC@NUS 19 Thesis defense by Xiaoli Wang CA-based filtering strategy Sorted lists for a query graph Lists are sorted increasingly in order of star edit distances Stars of a database graph are accessed by increasing dissimilarity to stars of the query graph a query graph q Graph identity 2015/4/13 q: s1 q: s2 g1: 0 g4: 0 g2: 1 g1: 1 g3: 2 g2: 3 … g6.: 5 g6.: 5 … g4: 6 g3: 9 SOC@NUS λ(s, s2), where s is a star in S(g4) Sorted lists 20 Thesis defense by Xiaoli Wang CA-based filtering strategy Use mapping distance as the aggregation value When CA halts, all unseen graphs are sure to have mapping distances larger than the current threshold value. They can be safely filtered out, as mapping distance is a lower bound of graph edit distance. a query graph q q: s1 Seen Possibly unseen 2015/4/13 q: s2 g1: 0 g4: 0 g2: 1 g1: 1 g3: 2 g2: 3 … g6.: 5 g6.: 5 … g4: 6 g3: 9 SOC@NUS If λ(g6, q) ≥ ω, g6 can be safely filtered out ω = sum(2, 3) = 5 > d (= 4) Threshold value 21 Thesis defense by Xiaoli Wang Overview of the search framework Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars 2015/4/13 a c a b d b c a c e d b a c a f … b A graph database SOC@NUS c 22 Thesis defense by Xiaoli Wang Overview of the search framework a c g1 b a s0 c b a s0 c a b s2 d a b s1 c a c s4 d a c s3 b b d s5 c Star decomposition b d a c g2 b ε Two lists for s0 and s3 s0: abc gid frep g2 1 g1 1 ε s3: cab gid feq g2 1 Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars 2015/4/13 a c a b d b c a c e d b a c a f … b A graph database SOC@NUS c 23 Thesis defense by Xiaoli Wang Overview of the search framework Two lists for vertex labels a and b b a s0 b a s0 c a b s2 d a b s1 c a c s4 d a c s3 b b d s5 c c ε a b sid frep sid feq s1 1 s0 1 s2 1 s3 1 s3 1 s5 1 s4 1 Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars 2015/4/13 a c a b d b c a c e d b a c a f … b A graph database SOC@NUS c 24 Thesis defense by Xiaoli Wang Overview of the search framework a query graph q Similar graphs to the query q: s2 q: s1 The graph similarity search construct graph score-sorted lists using the top-k stars returned from the lower-level The top-k sub-unit query Construct a score-sorted list Run CA algorithm for graph pruning Return top-k stars using TA method Two-level index Upper-level inverted index between graphs and stars Lower-level inverted index between vertex labels and stars 2015/4/13 a c a b d b c a c e d b a c a f … b A graph database SOC@NUS c 25 Thesis defense by Xiaoli Wang Experimental results: query performance Response Time(sec) Candidate Size(K) k-AT C-Tree SEGOS 30 10 25 20 1 15 10 0.1 5 0 0 0.01 2 4 6 8 10 12 14 16 18 20 Edit Distance Threshold(0 to 20) 10 30 25 1 0.1 20 15 10 0.01 5 0 0.001 5 10 15 20 25 30 35 40 Data Size(5K to 40K) 2015/4/13 SOC@NUS 26 Thesis defense by Xiaoli Wang Outline Motivation Complex structures Limitations on existing systems Our solutions Inverted index for graph similarity search Inverted index for sequence similarity search A 3-in-1 inverted indexing tool and its applications Conclusion and future work 2015/4/13 SOC@NUS 27 Thesis defense by Xiaoli Wang Sequence similarity search problem KNN Query in a sequence database : Given a sequence database, we want to find its k most similar sequences based on sequence edit distance A query sequence Similar sequences A sequence database Sequence edit distance: λ(s1,s2) = minimum # of operations (insertion, deletion, substitution) to transform sequence s1 into sequence s2 2015/4/13 SOC@NUS 28 Thesis defense by Xiaoli Wang Existing works A naive solution: No pruning and poor scalability 100,000 candidates 100,000 full scan SED computation answers 50 Trie-based methods: Too complex time and space cost Poor scalability on long strings SED computation candidates and large databases answers 70% scan index 100,000 50 Gram-based methods: Too many false positives SED bound computation 100,000 2015/4/13 index SED computation candidates Poor scalability 90,000 answers on large 50 databases Loose GED bound SOC@NUS 29 Thesis defense by Xiaoli Wang Our solution Extend the proposed indexing mechanism on graph search to support efficient string search Use tighter SED bounds to reduce candidate size SED bound computation candidates 100,000 index SED computation 100 answers 50 Inverted index based on the n-gram decomposition The sequence is decomposed into smaller n-grams Similar n-grams are used to tightly bound SED Efficient CA-based filtering strategy is proposed 2015/4/13 SOC@NUS 30 Thesis defense by Xiaoli Wang The n-gram based decomposition i ntroduct ion Sliding Window 5-grams Intuition: Similar strings must share enough number of common n-grams Intuition: Similar strings must share enough number of approximate n-grams Approximate n-gram: n-gram with certain edit distance 2015/4/13 SOC@NUS 31 Thesis defense by Xiaoli Wang Count filtering with approximate n-grams Lemma Consider two strings s1 and s2. If s1 and s2 are within an edit distance of τ, then s1 and s2 must share at least ɸ(s1, s2) = max{|s1|, |s2|} - n + 1 - η(τ, t, n) n-grams with gram edit distance ≤ t. η(τ, t, n) = max{1, n-2×t}+(n-t) ×(τ-1) η(τ, t, n) is the maximum # of n-grams affected by τ edit operations to have gram edit distance >t. For example, η(τ, 0, n) = τ×n. This means that τ edit operations will affect at most τ×n n-grams to have gram edit distance >0. 2015/4/13 SOC@NUS 32 Thesis defense by Xiaoli Wang Mapping distance: distance between two n-gram representations s=“Emm Wo” q=“EmmaW” ng0 ng1 ng3 ng0 0 2 3 ng2 2 1 2 ng4 3 2 1 ng5 3 3 3 Emm 0 ng0 1 mm_ ng2 m_W ng4 _Wo ng5 Emm ng0 mma ng1 maW ng3 ε 3 3 1 3 3 3 G(s) ε G(q) The mapping distance: μ(s, q) = 0 + 1 + 1 + 3 = 5 2015/4/13 SOC@NUS 33 Thesis defense by Xiaoli Wang The lower bound on mapping distance Given two strings s1 and s2. The gram mapping distance μ (s1, s2) between s1 and s2 satisfies μ(s1, s2) ≤ (3n-2)×λ(s1, s2) Given an edit distance threshold value of τ. If we use the summation of gram edit distances as the aggregation function in the CA method, then the threshold value can be computed as τ×(3n-2) 2015/4/13 SOC@NUS 34 Thesis defense by Xiaoli Wang CA-based filtering strategy Algorithm halts: The aggregation value: 2+2+2+2+2+1+1+1+1 = 14 The maximum value in the heap is: max { λs| s∈ top-1} = 1 The CA halts: 14 > 1×(3n-2) = 13 For any unseen sequence, its mapping distance is larger than the aggregation value. This means its edit distance is larger than the maximum edit distance value in the top-k heap. It can be safely filtered out. Top-1 heap GED=0 L0 20 30 L1 20 30 L2 20 30 L3 20 30 L4 20 30 L5 20 30 GED=1 GED=2 2015/4/13 20 30 40 20 30 20 30 20 30 20 30 10 20 30 40 L6 20 L7 20 10 30 10 20 30 20 30 40 L8 10 20 20 ID 10 20 30 40 SOC@NUS Intermediate results Freq.: 0 Freq.: 1 0 -8 8 6 7 0 -- μ(L) -1 3 -- λ -1 3 -- 35 Thesis defense by Xiaoli Wang Overview of the search framework Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2015/4/13 SOC@NUS 36 Thesis defense by Xiaoli Wang Overview of the search framework The upper-level inverted index ng0 ng1 ng2 ng3 ng4 ng5 ng6 ng7 ng8 ng9 trodu uctio uctiv ntrod oduct roduc ction intro ctive ducti 10 10 20 10 10 10 10 10 20 10 20 20 20 20 20 20 5-grams ID Sequence 10 Introduction 20 introductive Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2015/4/13 SOC@NUS 37 Thesis defense by Xiaoli Wang Overview of the search framework The upper-level inverted index ng0 ng1 ng2 ng3 ng4 ng5 ng6 ng7 ng8 ng9 trodu uctio uctiv ntrod oduct roduc ction intro ctive ducti 10 10 20 10 10 10 10 10 20 10 20 20 20 20 20 20 2-grams The lower-level inverted index nt ct du in io iv od on ro uc ti ve tr ng3 ng1 ng0 ng7 ng1 ng2 ng0 ng6 ng0 ng1 ng1 ng8 ng0 5-grams ng7 ng2 ng4 ng6 ng8 ng3 ng3 ng2 ng2 ng3 ID Sequence ng4 ng5 ng4 ng5 ng4 ng6 ng7 10 Introduction ng6 ng9 ng5 ng7 ng5 ng8 20 introductive ng8 ng9 ng9 ng9 Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2015/4/13 SOC@NUS 38 Thesis defense by Xiaoli Wang Overview of the search framework a query sequence q Similar sequences to the query q: ng2 q: ng1 The sequence KNN search Construct score-sorted lists using the n1-grams from the lower-level The n1-gram similarity search Update value of t Return n1-grams with distance ≤ t Accumulate the count of similar n1-grams, and run CA algorithm to check the halting condition Two-level index Upper-level inverted index between sequences and n1-grams Lower-level inverted index between n1-grams and n2-grams A sequence database 2015/4/13 SOC@NUS 39 Xiaoli WANG: Technical Meeting Thesis defense by Xiaoli Wang Experimental results Dataset: 50K, average length=350, max length=5000 100 query sequences Quality of count filtering 2015/4/13 KNN query performance SESAME@IDMI 40 Thesis defense by Xiaoli Wang Outline Motivation Complex structures Limitations on existing systems Our solutions Inverted index for graph similarity search Inverted index for sequence similarity search A 3-in-1 inverted indexing tool and its applications Conclusion and future work 2015/4/13 SOC@NUS 41 Thesis defense by Xiaoli Wang The 3-in-1 storage system Our proposed inverted index can support efficient graph and sequence similarity search Challenges on inverted index Various complex structures => unified key and value data structure Various query processing algorithms => general list processing method 2015/4/13 SOC@NUS 42 Thesis defense by Xiaoli Wang Challenge 1: unified index structures Key: smaller substructures graph (star), sequence (n-gram), tree (binary branch[1]) A key is a string Posting list: a list of values value (the complex structure linked to the key) A value is an integer of document number Unified inverted index (Terms with posting list of document ids) stars n-grams binary branches Graph data Sequence data Tree data 1. Yang, R. etc. Similarity Evaluation on Tree-structured Data. In SIGMOD 2005. 2015/4/13 SOC@NUS 43 Thesis defense by Xiaoli Wang Challenge 1: unified index structures Key: smaller substructures graph (star), sequence (n-gram), tree (binary branch[1]) A key is a string Posting list: a list of values value (the complex structure linked to the key) A value is an integer of document number ID Sequence 10 Introduction 20 introductive The upper-level inverted index 5-grams trodu uctio uctiv ntrod oduct roduc ction intro ctive ducti 10 20 10 20 10 20 10 20 10 20 10 10 20 20 10 20 1. Yang, R. etc. Similarity Evaluation on Tree-structured Data. In SIGMOD 2005. 2015/4/13 SOC@NUS 44 Thesis defense by Xiaoli Wang Challenge 2: common list processing List scanning: CA-based method Mapping distance as the aggregation value Count filtering as the pruned technique Implementation on top of Lucene[1] Application layer Graph similarity search Sequence similarity search Tree similarity search List processing Lucene index Index layer 1. http://lucene.apache.org/. 2015/4/13 SOC@NUS 45 Thesis defense by Xiaoli Wang Real application: social reading systems Book duplicate detection problem Graph matching problem Book annotation migration problem Sequence matching problem Annotation search by snapping problem Sequence similarity search problem Build an information management tool to support the storage and retrieval of sequences and graphs 2015/4/13 SOC@NUS 46 Thesis defense by Xiaoli Wang Real application: social reading systems An information management tool for Readpeer.com[1] 1. http://readpeer.com/. 2015/4/13 SOC@NUS 47 Thesis defense by Xiaoli Wang Real application: social reading systems An information management tool for Readpeer.com[1] Efficient annotation retrieval: about 650ms Book duplicate detection: about 1 sec 1. http://readpeer.com/. 2015/4/13 SOC@NUS 48 Thesis defense by Xiaoli Wang Outline Motivation Complex structures Limitations on existing systems Our solutions Inverted index for graph similarity search Inverted index for sequence similarity search A 3-in-1 inverted indexing tool and its applications Conclusions and future work 2015/4/13 SOC@NUS 49 Thesis defense by Xiaoli Wang Conclusions Build a unified 3-in-1 index for various complex structures Propose an inverted index to support efficient graph similarity search Further extend the inverted index to support efficient sequence similarity search Implement a unified inverted index to support various similarity search problems on complex structures The unified inverted index helps to build an effective information management tool in our real social reading system 2015/4/13 SOC@NUS 50 Thesis defense by Xiaoli Wang Future works Implement the unified index using distributed system Other Challenges on inverted index Index size is proportional to data size => disk base storage List processing may be slow with multiple I/Os => parallel processing to distribute I/Os into different machines 2015/4/13 SOC@NUS 51 Thesis defense by Xiaoli Wang Future works Implement the unified index using distributed system 2015/4/13 SOC@NUS 52 Thesis defense by Xiaoli Wang Future works Extend the unified inverted index to support more types of queries on other data models Nested structured data High dimensional data (e.g., image or video query processing) 2015/4/13 SOC@NUS 53 Thesis defense by Xiaoli Wang Thank You! Q&A 2015/4/13 SOC@NUS 54