defense - Burkhardt, Stefan

Filter Algorithms for Approximate String Matching Stefan Burkhardt Outline  Motivation  Filter Algorithms  Gapped q-grams  Experimental Analysis Problems and Motivation Why ? Approximate String Matching Edit and Hamming Distance Motivation Computational Biology:  EST Clustering  Assembly  Genome comparison (e.g. Human/Mouse) Information Retrieval  Phonebooks  Dictionaries  Search Engines Many more…. Problems and Motivation Why ? Approximate String Matching Edit and Hamming Distance The global approximate string matching problem P Given a pattern P, a target S, an error level k and a string distance d(x,y): GAT Find all substrings y from S with: d( P, y )  k ACTGATAACGTTAGCCATGG S Problems and Motivation Why ? Approximate String Matching Edit and Hamming Distance The global approximate string matching problem P d(x,y) = Hamming Distance: The k-mismatches problem GAT d(x,y) = Edit Distance: The k-differences problem ACTGATAACGTTAGCCATGG S Filter Algorithms S P Filter Algorithm How? BLAST The q-gram Lemma and QUASAR Filtration Phase, apply Filter Criterion Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches How? BLAST The q-gram Lemma and QUASAR Filter Algorithms BLAST (Altschul, Karlin, et al.) : Sequential scan of S locates all matching q-grams with P Iterative extension with cutoff to find good matches S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific Filter Algorithms S How? BLAST The q-gram Lemma and QUASAR Preprocess P Indexed Filter Algorithm Index Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches Filter Algorithms S How? BLAST The q-gram Lemma and QUASAR Preprocess P Indexed Filter Algorithm Index Potential Matches Pro: potentially faster evaluation of filter criterium Con: preprocessing time extra space required only good for some filter criteria Filter Algorithms S How? BLAST The q-gram Lemma and QUASAR Preprocess P Indexed Filter Algorithm Index Potential Matches QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion: q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection: overlapping rectangles in DP-Matrix Filter Algorithms The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams). How? BLAST The q-gram Lemma and QUASAR TCGATTAC TCG CGA GAT ATT TTA TAC TCGAATAC |P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 Each error can ´destroy´ q matching q-grams => for k errors lose kq q-grams How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P 3 hits 3 hits 2 hits t=3 2 hits 1 hit How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P S QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR) Filter Algorithms How? BLAST The q-gram Lemma and QUASAR QUASAR (Burkhardt, Rivals et al. 1999) :  BLAST for the verification of the potential matches  wider Rectangles as Match Regions  Index is a combination of Lookup Table and Suffix Array  used for EST-Clustering at the DKFZ in Heidelberg  searches for EST-Clustering about 30 times faster than BLAST Gapped q-grams  A new (old?) idea  Hamming Distance  Finding good shapes A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams General idea:  use gapped q-grams  call arrangement of gaps the shape gapped 3-shape: ##.# Match Don’t care TCGATTAC TC.A CG.T GA.T AT.A TT.C A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Previous work...  Califano, Rigoutsos (1993)  Pevzner, Waterman (1995)  Lehtinen, Sutinen, Tarhio (1996)  no exact threshold for the general case given  limited attention paid to choice of shapes Recently... Buhler (2001) : Multiple Shapes Ma, Tromp, Li (2002) : Pattern Hunter  threshold t = 1 A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic OOXOOXOOXOO 3-shape OOX ### OXO XOO k=3 OOX t=0 OXO no filter! XOO OOX OXO XOO gapped OOOXXOOXOOO 3-shape OO.X OO.X ##.# OX.O k=3 XX.O t=1 XO.X OO.O OX.O XO.O A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic 3-shape ### k=3 t=0 no filter! gapped OOOXXOOXOOO 3-shape OO.X OO.X ##.# OX.O k=3 XX.O t=1 XO.X  gapped shapes can have higher(!) OO.O thresholds t than ungapped shapes OX.O XO.O  no simple formula for t  we used a DP-based approach to compute t Gapped q-grams Finding good shapes low low good filters tradeoff line # of verific. potential time matches high A new (old ?) idea Hamming Distance Finding good shapes bad filters high high filtration time low high # of q-gram hits low low q high Gapped q-grams Finding good shapes low ? A new (old ?) idea Hamming Distance Finding good shapes good filters tradeoff line # of potential matches bad filters high low q # of q-gram hits  high 1 |S| q |S| A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches. Reason: ##.# ##.# ----5 ### ### ---4 A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Finding good shapes CGACGATTGAT ##.# ##.# ----ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5 We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S Gapped q-grams Finding good shapes A new (old ?) idea Hamming Distance Finding good shapes good filters high tradeoff line cm # of 1 |S| potential  cm matches |S| bad filters low low q # of q-gram hits  high 1 |S| q |S| A new (old ?) idea Hamming Distance Finding good shapes Gapped q-grams Finding good shapes • compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6 median 600 t=1 t=2 t=3 best t=4 t=5 contiguous number of shapes 400 with given minimum coverage 200 for k = 5 q=8 0 8 10 12 14 16 18 minimum coverage 20 22 Experimental Analysis  Speed and Filtration Efficiency  The Heuristic Zone A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 2-8 2-4 matches k=5 |P| = 50 |S| = 50Mbps 1 24 28 212 minimum coverage 216 24 222 220 218 216 hits 20 214 212 gapped, Hamming contiguous 16 12 8 6 7 8 9 q 10 11 12 Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 Errors |P| Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 k Errors |P| Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 k Errors |P|-mc |P| Describing Filter Properties From Hits to Matches Filters usually have 3 ‚recognition zones` depending on k : 1. Guarantee zone (finds all approximate matches) 2. Heuristic zone (finds some of the approximate matches) 3. Negative zone (guaranteed not to find matches) Recognition rate 100% 0% 0 k Errors |P|-mc |P| A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis Problem: Behaviour in the Heuristic Zone hard to predict Heuristic Zone Recognition rate 100% 0% 0 k Errors |P|-mc |P| A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate s sample strings with i random errors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 BLAST k=3,q=11 k=4,q=10 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 BLAST k=3,q=11 k=4,q=10 0% 0 5 10 15 Errors 20 25 30 Experimental Analysis |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone 100% Recognition rate contiguous k=3, q=11 k=4, q=9 gapped, edit k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 BLAST k=4,q=10 50% 0 5 10 Errors 15 Conclusion - Future Work Our Work:  Significant sensitivity improvement over existing filters  Required modifications easy to implement  Methods for describing filter properties Future Work:  Combination of `orthogonal` shapes into one filter  Use of word neighborhoods  Database of filter properties for good shapes

defense - Burkhardt, Stefan

Related documents

Products

Support

defense - Burkhardt, Stefan

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib