defense - Burkhardt, Stefan

advertisement
Filter Algorithms for
Approximate String Matching
Stefan Burkhardt
Outline
 Motivation
 Filter Algorithms
 Gapped q-grams
 Experimental Analysis
Problems and Motivation
Why ?
Approximate String Matching
Edit and Hamming Distance
Motivation
Computational Biology:
 EST Clustering
 Assembly
 Genome comparison (e.g. Human/Mouse)
Information Retrieval
 Phonebooks
 Dictionaries
 Search Engines
Many more….
Problems and Motivation
Why ?
Approximate String Matching
Edit and Hamming Distance
The global approximate
string matching problem
P
Given a pattern P, a target S, an
error level k and a string distance
d(x,y):
GAT
Find all substrings y from S with:
d( P, y )  k
ACTGATAACGTTAGCCATGG
S
Problems and Motivation
Why ?
Approximate String Matching
Edit and Hamming Distance
The global approximate
string matching problem
P
d(x,y) = Hamming Distance:
The k-mismatches problem
GAT
d(x,y) = Edit Distance:
The k-differences problem
ACTGATAACGTTAGCCATGG
S
Filter Algorithms
S
P
Filter
Algorithm
How?
BLAST
The q-gram Lemma and QUASAR
Filtration Phase,
apply Filter Criterion
Potential Matches
Exact
Algorithm
Verification Phase,
examine Potential Matches
True Matches
False Matches
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
BLAST (Altschul, Karlin, et al.) :
Sequential scan of S locates all matching q-grams with P
Iterative extension with cutoff to find good matches
S
P
Problem for high similarity: sequential scan quite time consuming
single q-grams unspecific
Filter Algorithms
S
How?
BLAST
The q-gram Lemma and QUASAR
Preprocess
P
Indexed Filter
Algorithm
Index
Potential Matches
Exact
Algorithm
Verification Phase,
examine Potential Matches
True Matches
False Matches
Filter Algorithms
S
How?
BLAST
The q-gram Lemma and QUASAR
Preprocess
P
Indexed Filter
Algorithm
Index
Potential Matches
Pro:
potentially faster evaluation of filter criterium
Con: preprocessing time
extra space required
only good for some filter criteria
Filter Algorithms
S
How?
BLAST
The q-gram Lemma and QUASAR
Preprocess
P
Indexed Filter
Algorithm
Index
Potential Matches
QUASAR (Burkhardt, Rivals et al. 99):
Filter Criterion:
q-gram Lemma (Jokinen, Ukkonen 91)
Index Structure:
Lookup table (Jokinen, Ukkonen 91)
with suffix array (Manber, Myers 90)
Match Detection:
overlapping rectangles in DP-Matrix
Filter Algorithms
The q-gram Lemma
(Jokinen, Ukkonen, 1991)
For a pattern P, a substring y of S and
a value k, matches between P and y
with at most k errors share at least
t = |P| - q + 1 - (kq)
substrings of length q (q-grams).
How?
BLAST
The q-gram Lemma and QUASAR
TCGATTAC
TCG
CGA
GAT
ATT
TTA
TAC
TCGAATAC
|P| =8, q = 3
total # of q-grams :
|P| - q + 1 = 6
Each error can ´destroy´
q matching q-grams
=> for k errors lose
kq q-grams
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
Match Detection (Jokinen, Ukkonen 91) :
overlapping rectangles of width 2|P| in DP-Matrix
rectangle with at least t hits => potential match
S
P
3 hits
3 hits
2 hits
t=3
2 hits
1 hit
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
Match Detection (Jokinen, Ukkonen 91) :
overlapping rectangles of width 2|P| in DP-Matrix
rectangle with at least t hits => potential match
S
P
S
QUASAR (Burkhardt, Rivals et al. 1999) :
wider rectangles efficient in practice (2048 for QUASAR)
Filter Algorithms
How?
BLAST
The q-gram Lemma and QUASAR
QUASAR (Burkhardt, Rivals et al. 1999) :
 BLAST for the verification of the potential matches
 wider Rectangles as Match Regions
 Index is a combination of Lookup Table and Suffix Array
 used for EST-Clustering at the DKFZ in Heidelberg
 searches for EST-Clustering about 30 times faster than BLAST
Gapped q-grams
 A new (old?) idea
 Hamming Distance
 Finding good shapes
A new (old ?) idea
Hamming Distance
Finding good shapes
Gapped q-grams
General idea:
 use gapped q-grams
 call arrangement of
gaps the shape
gapped
3-shape:
##.#
Match
Don’t care
TCGATTAC
TC.A
CG.T
GA.T
AT.A
TT.C
A new (old ?) idea
Hamming Distance
Finding good shapes
Gapped q-grams
Previous work...
 Califano, Rigoutsos (1993)
 Pevzner, Waterman (1995)
 Lehtinen, Sutinen, Tarhio (1996)
 no exact threshold for the general case given
 limited attention paid to choice of shapes
Recently...
Buhler (2001) : Multiple Shapes
Ma, Tromp, Li (2002) : Pattern Hunter
 threshold t = 1
A new (old ?) idea
Hamming Distance
Finding good shapes
Gapped q-grams
The Threshold t
Definition: t is the number of remaining q-grams
in a worst-case placement of k errors
classic OOXOOXOOXOO
3-shape OOX
###
OXO
XOO
k=3
OOX
t=0
OXO
no filter!
XOO
OOX
OXO
XOO
gapped OOOXXOOXOOO
3-shape OO.X
OO.X
##.#
OX.O
k=3
XX.O
t=1
XO.X
OO.O
OX.O
XO.O
A new (old ?) idea
Hamming Distance
Finding good shapes
Gapped q-grams
The Threshold t
Definition: t is the number of remaining q-grams
in a worst-case placement of k errors
classic
3-shape
###
k=3
t=0
no filter!
gapped OOOXXOOXOOO
3-shape OO.X
OO.X
##.#
OX.O
k=3
XX.O
t=1
XO.X
 gapped shapes can have higher(!)
OO.O
thresholds t than ungapped shapes
OX.O
XO.O
 no simple formula for t
 we used a DP-based approach to compute t
Gapped q-grams
Finding good shapes
low
low
good
filters
tradeoff
line
# of verific.
potential time
matches
high
A new (old ?) idea
Hamming Distance
Finding good shapes
bad
filters
high
high
filtration time
low
high
# of q-gram hits
low
low
q
high
Gapped q-grams
Finding good shapes
low
?
A new (old ?) idea
Hamming Distance
Finding good shapes
good
filters
tradeoff
line
# of
potential
matches
bad
filters
high
low
q
# of q-gram hits 
high
1
|S|
q
|S|
A new (old ?) idea
Hamming Distance
Finding good shapes
Gapped q-grams
Finding good shapes
For |P |=13, k=3 and q=3 the shapes ##.# and
### both have a threshold of t=2. However, the
gapped shape returns fewer potential matches.
Reason:
##.#
##.#
----5
###
###
---4
A random match requires 5
matching characters instead
of only 4 for the ungapped
q-gram.
This makes random matches
less likely.
A new (old ?) idea
Hamming Distance
Finding good shapes
Gapped q-grams
Finding good shapes
CGACGATTGAT
##.#
##.#
----ACTCGATTAGA
For t =2 and
the shape
##.#
the minimum
coverage is 5
We define the minimum
coverage cm as the
minimum number of
matching characters for any
distinct arrangement of t
matching shapes in P and S
Gapped q-grams
Finding good shapes
A new (old ?) idea
Hamming Distance
Finding good shapes
good
filters
high
tradeoff
line
cm
# of
1
|S|
potential 
cm
matches
|S|
bad
filters
low
low
q
# of q-gram hits 
high
1
|S|
q
|S|
A new (old ?) idea
Hamming Distance
Finding good shapes
Gapped q-grams
Finding good shapes
• compute t and minimum coverage for all shapes with
|P|=50 and k=3,4,5,6
median
600
t=1
t=2
t=3
best
t=4
t=5
contiguous
number of
shapes
400
with given
minimum
coverage 200
for k = 5
q=8
0
8
10
12
14 16
18
minimum coverage
20
22
Experimental Analysis
 Speed and Filtration Efficiency
 The Heuristic Zone
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
2-8
2-4
matches
k=5
|P| = 50
|S| = 50Mbps
1
24
28
212
minimum coverage
216
24
222
220
218
216
hits
20
214
212
gapped, Hamming
contiguous
16
12
8
6
7
8
9
q
10
11
12
Describing Filter Properties
From Hits to Matches
Filters usually have 3 ‚recognition zones` depending on k :
1. Guarantee zone (finds all approximate matches)
2. Heuristic zone (finds some of the approximate matches)
3. Negative zone (guaranteed not to find matches)
Recognition
rate
100%
0%
0
Errors
|P|
Describing Filter Properties
From Hits to Matches
Filters usually have 3 ‚recognition zones` depending on k :
1. Guarantee zone (finds all approximate matches)
2. Heuristic zone (finds some of the approximate matches)
3. Negative zone (guaranteed not to find matches)
Recognition
rate
100%
0%
0
k
Errors
|P|
Describing Filter Properties
From Hits to Matches
Filters usually have 3 ‚recognition zones` depending on k :
1. Guarantee zone (finds all approximate matches)
2. Heuristic zone (finds some of the approximate matches)
3. Negative zone (guaranteed not to find matches)
Recognition
rate
100%
0%
0
k
Errors |P|-mc
|P|
Describing Filter Properties
From Hits to Matches
Filters usually have 3 ‚recognition zones` depending on k :
1. Guarantee zone (finds all approximate matches)
2. Heuristic zone (finds some of the approximate matches)
3. Negative zone (guaranteed not to find matches)
Recognition
rate
100%
0%
0
k
Errors |P|-mc
|P|
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
Problem:
Behaviour in the Heuristic Zone hard to predict
Heuristic Zone
Recognition
rate
100%
0%
0
k
Errors |P|-mc
|P|
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
A simple idea:
Sampling!
For a value i:
1. Generate s sample strings with i random errors each
2. Run a filter algorithm on these samples
3. Record how many strings were recognized (in percent)
This allows an experimental evaluation of the Heuristic Zone
Experimental Analysis
|P| = 50
1000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
100%
Recognition rate
contiguous k=3, q=11
k=4, q=9
0%
0
5
10
15
Errors
20
25
30
Experimental Analysis
|P| = 50
1000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
100%
Recognition rate
contiguous k=3, q=11
k=4, q=9
gapped, edit k=3, q=11
k=4, q=11
k=5, q=10
0%
0
5
10
15
Errors
20
25
30
Experimental Analysis
|P| = 50
1000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
100%
Recognition rate
contiguous k=3, q=11
k=4, q=9
gapped, edit k=3, q=11
k=4, q=11
k=5, q=10
BLAST
k=3,q=11
k=4,q=10
0%
0
5
10
15
Errors
20
25
30
Experimental Analysis
|P| = 50
1000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
100%
Recognition rate
contiguous k=3, q=11
k=4, q=9
gapped, edit k=3, q=11
k=4, q=11
k=5, q=10
BLAST
k=3,q=11
k=4,q=10
0%
0
5
10
15
Errors
20
25
30
Experimental Analysis
|P| = 50
1000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
100%
Recognition rate
contiguous k=3, q=11
k=4, q=9
gapped, edit k=3, q=11
k=4, q=11
k=5, q=10
k=3,q=11
BLAST
k=4,q=10
50%
0
5
10
Errors
15
Conclusion - Future Work
Our Work:
 Significant sensitivity improvement over existing filters
 Required modifications easy to implement
 Methods for describing filter properties
Future Work:
 Combination of `orthogonal` shapes into one filter
 Use of word neighborhoods
 Database of filter properties for good shapes
Download