Filter Algorithms for Approximate String Matching Stefan Burkhardt Outline Motivation Filter Algorithms Gapped q-grams Experimental Analysis Problems and Motivation Why ? Approximate String Matching Edit and Hamming Distance Motivation Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse) Information Retrieval Phonebooks Dictionaries Search Engines Many more…. Problems and Motivation Why ? Approximate String Matching Edit and Hamming Distance The global approximate string matching problem P Given a pattern P, a target S, an error level k and a string distance d(x,y): GAT Find all substrings y from S with: d( P, y ) k ACTGATAACGTTAGCCATGG S Problems and Motivation Why ? Approximate String Matching Edit and Hamming Distance The global approximate string matching problem P d(x,y) = Hamming Distance: The k-mismatches problem GAT d(x,y) = Edit Distance: The k-differences problem ACTGATAACGTTAGCCATGG S Filter Algorithms S P Filter Algorithm How? BLAST The q-gram Lemma and QUASAR Filtration Phase, apply Filter Criterion Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches How? BLAST The q-gram Lemma and QUASAR Filter Algorithms BLAST (Altschul, Karlin, et al.) : Sequential scan of S locates all matching q-grams with P Iterative extension with cutoff to find good matches S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific Filter Algorithms S How? BLAST The q-gram Lemma and QUASAR Preprocess P Indexed Filter Algorithm Index Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches Filter Algorithms S How? BLAST The q-gram Lemma and QUASAR Preprocess P Indexed Filter Algorithm Index Potential Matches Pro: potentially faster evaluation of filter criterium Con: preprocessing time extra space required only good for some filter criteria Filter Algorithms S How? BLAST The q-gram Lemma and QUASAR Preprocess P Indexed Filter Algorithm Index Potential Matches QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion: q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection: overlapping rectangles in DP-Matrix Filter Algorithms The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams). How? BLAST The q-gram Lemma and QUASAR TCGATTAC TCG CGA GAT ATT TTA TAC TCGAATAC |P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 Each error can ´destroy´ q matching q-grams => for k errors lose kq q-grams How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P 3 hits 3 hits 2 hits t=3 2 hits 1 hit How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P S QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR) Filter Algorithms How? BLAST The q-gram Lemma and QUASAR QUASAR (Burkhardt, Rivals et al. 1999) : BLAST for the verification of the potential matches wider Rectangles as Match Regions Index is a combination of Lookup Table and Suffix Array used for EST-Clustering at the DKFZ in Heidelberg searches for EST-Clustering about 30 times faster than BLAST Gapped q-grams A new (old?) idea Hamming Distance Finding good shapes A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams General idea: use gapped q-grams call arrangement of gaps the shape gapped 3-shape: ##.# Match Don’t care TCGATTAC TC.A CG.T GA.T AT.A TT.C A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Previous work... Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996) no exact threshold for the general case given limited attention paid to choice of shapes Recently... Buhler (2001) : Multiple Shapes Ma, Tromp, Li (2002) : Pattern Hunter threshold t = 1 A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic OOXOOXOOXOO 3-shape OOX ### OXO XOO k=3 OOX t=0 OXO no filter! XOO OOX OXO XOO gapped OOOXXOOXOOO 3-shape OO.X OO.X ##.# OX.O k=3 XX.O t=1 XO.X OO.O OX.O XO.O A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic 3-shape ### k=3 t=0 no filter! gapped OOOXXOOXOOO 3-shape OO.X OO.X ##.# OX.O k=3 XX.O t=1 XO.X gapped shapes can have higher(!) OO.O thresholds t than ungapped shapes OX.O XO.O no simple formula for t we used a DP-based approach to compute t Gapped q-grams Finding good shapes low low good filters tradeoff line # of verific. potential time matches high A new (old ?) idea Hamming Distance Finding good shapes bad filters high high filtration time low high # of q-gram hits low low q high Gapped q-grams Finding good shapes low ? A new (old ?) idea Hamming Distance Finding good shapes good filters tradeoff line # of potential matches bad filters high low q # of q-gram hits high 1 |S| q |S| A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches. Reason: ##.# ##.# ----5 ### ### ---4 A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Finding good shapes CGACGATTGAT ##.# ##.# ----ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5 We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S Gapped q-grams Finding good shapes A new (old ?) idea Hamming Distance Finding good shapes good filters high tradeoff line cm # of 1 |S| potential cm matches |S| bad filters low low q # of q-gram hits high 1 |S| q |S| A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Finding good shapes • compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6 median 600 t=1 t=2 t=3 best t=4 t=5 contiguous number of shapes 400 with given minimum coverage 200 for k = 5 q=8 0 8 10 12 14 16 18 minimum coverage 20 22 Experimental Analysis Speed and Filtration Efficiency The Heuristic Zone A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 2-8 2-4 matches k=5 |P| = 50 |S| = 50Mbps 1 24 28 212 minimum coverage 216 24 222 220 218 216 hits 20 214 212 gapped, Hamming contiguous 16 12 8 6 7 8 9 q 10 11 12 Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 Errors |P| Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 k Errors |P| Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 k Errors |P|-mc |P| Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 k Errors |P|-mc |P| A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis Problem: Behaviour in the Heuristic Zone hard to predict Heuristic Zone Recognition rate 100% 0% 0 k Errors |P|-mc |P| A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate s sample strings with i random errors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 BLAST k=3,q=11 k=4,q=10 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 BLAST k=3,q=11 k=4,q=10 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 BLAST k=4,q=10 50% 0 5 10 Errors 15 Conclusion - Future Work Our Work: Significant sensitivity improvement over existing filters Required modifications easy to implement Methods for describing filter properties Future Work: Combination of `orthogonal` shapes into one filter Use of word neighborhoods Database of filter properties for good shapes