Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu 1 Example: a movie database Find movies starred Schwarrzenger. Star Keanu Reeves Title The Matrix Year 1999 Genre Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime 2 In general: Gap between Queries and Data Errors in the query The user doesn’t remember a string exactly The user unintentionally types a wrong string Query: Schwarrzenger. Data : Schwarzenegger 3 … … Data may not clean Errors in the database: Data often is not clean by itself, especially true in data integration and cleansing Relation R Star Relation S Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson 4 Query may include error 5 Problem definition: approximate string searches Collection of strings s Search Query q Star Keanu Reeves Samuel Jackson Schwarzenegger Samuel Jackson … Output: strings s that satisfy Sim(q,s)≤δ 6 Example Similarity Function: Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 7 Example: approximate string searches Collection of strings s Search Query q Tom Hanks Star Tom Hank Thomas Hanks Ton Hank Tom J. Hanks … Output: strings s that satisfy ed(q,s)≤2 8 Outline Problem motivation Preliminary Grams Inverted lists Merge algorithms Filtering technique Conclusion 9 String Grams q-grams For example: 2-gram u n i v e r s a l (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10 Inverted lists Convert strings to gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 11 Main Example Query ed(s,q)≤1 stick (st,ti,ic,ck) Grams Data id strings 0 rich 1 stick 2 stich 3 stuck 4 static st 1,2,3,4 ti 1,2,4 ic 0,1,2,4 ck 1,3 Merge Candidate string ids {1,2,3,4} count >=2 Double check for the real edit distance ck 1,3 ic 0,1,2,4 Final answers st 1,2,3,4 {1,2,3} ta 4 ti … 1,2,4 Performance bottleneck! 12 Sub-problem definitions: Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T. 13 Example Count threshold: 4 1 10 5 3 13 7 5 15 13 13 15 10 13 Result: 13 14 Outline Problem motivation Preliminary Merge algorithms Two previous algorithms Our proposed three algorithms Filtering technique Conclusion 15 Five Merge Algorithms HeapMerger MergeOpt [Sarawagi,SIGMOD 2004] [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip 16 Two previous algorithms (1) Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap 17 Example of HeapMerger [Sarawagi et al 2004] 1 minHeap 10 13 5 15 1 10 5 3 13 7 5 15 13 13 15 10 13 Count threshold ≥ 4 18 Five Merge Algorithms MergeOpt HeapMerger [Sarawagi 2004] [Sarawagi 2004] Previous New ScanCount MergeSkip DivideSkip 19 Two previous algorithms (2) MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists 20 Example of MergeOpt [Sarawagi et al 2004] Min-heap 1 10 5 3 13 7 5 15 13 13 15 10 13 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4 21 Can we run faster? 22 Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip 23 Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each 24 element ScanCount Example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 0 1 0 2 0 1 0 0 2 0 0 4 0 1 Result:13 1 10 5 3 13 7 5 15 13 13 15 10 13 Count threshold ≥ 4 25 Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip 26 Our new algorithms (2) MergeSkip algorithm …… Min-heap Pop T-1 Jump T-1 27 Example of MergeSkip minHeap 1 10 5 3 13 7 5 15 13 13 15 10 13 Count threshold ≥ 4 28 Example of MergeSkip 1 minHeap 5 13 10 15 1 10 5 3 13 7 5 15 13 13 15 10 13 Count threshold ≥ 4 29 Example of MergeSkip minHeap Pop 1, 5,10 13 15 1 10 5 3 13 7 5 15 13 13 15 10 13 Count threshold ≥ 4 30 Example of MergeSkip minHeap Pop 1, 5,10 13 15 1 10 5 Jump 3 13 7 ≥ 13 5 15 13 13 15 10 13 Count threshold ≥ 4 31 Example of HeapMerger minHeap 13 13 13 13 15 1 10 5 3 13 7 5 15 13 13 15 Result:13 10 13 Count threshold ≥ 4 32 Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip 33 Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists 34 Size of long lists How many lists are treated as long lists? Cost: MergeOpt Binary search Long Lists Short Lists 35 Size of long lists How many lists are treated as long lists? Cost: MergeSkip Binary search Long Lists Short Lists 36 Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 37 Empirically verification Our formula about “L” achieves the best result over other options. 38 Experimental data sets Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus 39 Performance (DBLP data) DivideSkip is the best one Running time per query with various algorithms 40 # of elements reading (DBLP data) DivideSkip is the best one DivideSkip skips reading the most elements 41 Outline Problem motivation Preliminary Merge algorithms Filtering technique Length, positional filter [Gravano et al. VLDB 2001] Filter tree Conclusion and future work 42 Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19 43 Positional Filtering Positional Gram For example: string abcd: {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12) 44 Filter tree root 1 2 aa ab 3 … … zy 1 2 5 12 17 28 44 n Gram level zz … Length level m Position level Inverted list 45 Surprising experimental results(DBLP) Heap MergeOpt No filter Length 115.42 11.98 Length+Pos 3.64 Wisely use 14.22 1.40 filters, 6.78 more filters may be bad! ScanCount 30.91 2.68 2.14 MergeSkip 10.12 1.09 2.65 DivideSkip 2.23 0.76 1.96 Conclusion Three new merge algorithms We run faster Surprising experimental results Wisely use filters, more filters may be bad! Thank you! 48 Backup : related work Approximate string matching [Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007] 49 Reference 1. 2. 3. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001 50 Reference 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM:Improving performance of approximate queries on string collections using variablelength grams ” in VLDB 2007 5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004 51