Slides

advertisement
Approximate Substring Matching
over Uncertain Strings
Tingjian Ge Zheng Li
University of Kentucky
Motivation
• As a consequence of the burgeoning growth of data and the
cost/technology constraints against producing completely clean data,
there is much uncertainty in the data itself.
Text retrieval
Computational biology
Signal processing
Motivation
• The amount of text data increases in an unprecedented rate. Managing
the sheer amount of (often noisy) text data has become more challenging
than ever.
• Approximate substring matching has many applications.
– The deterministic case is well studied.
– But approximate pattern matching over uncertain texts is largely
an unexplored problem in the past.
Example : Pattern Matching in DNA
Sequence
• Suppose a certain DNA pattern, say AAATTT and it’s variations ( with small
edit distance ) are known to be the cause of a low or nonfunctioning
protein. Thus we want to do approximate pattern matching in DNA
sequences.
• Challenges are :
– A single DNA sequence can be a few million to a few hundred million
characters.
– It has uncertainty due to a number of factors in the high-throughput
sequencing technologies.
Example : Holter Monitor Application
• For each heartbeat, the annotation software gives a symbol such as N
(Normal beat), L (Left bundle branch block beat), etc. Quite often, the ECG
signal of each beat may have ambiguity.
• A doctor might be interested in locating a pattern such as “NNAV”, in order
to verify a specific diagnosis.
Outline
•
•
•
•
•
•
Motivation
Preliminaries
A new semantics
Multilevel filtering index
Two verification algorithms
Experiments
Q-gram Index
• The most frequently used indexing method for approximate matching is
the q-gram indexes.
• To use the q-gram index, partition pattern into k+1 pieces, where k is the
edit distance threshold. Since the number of errors is no more than k, it
must be true that at least one piece must have an exact match in the text.
Position list of q-gram1
Pattern String
Q-gram1
pos1
Q-gram2
…
…
…
…
k+1 pieces
pos2
…
Verification Algorithm
• A DP algorithm that computes edit distance
t h
Ins.
0 1 2
d [i, j ]  min{d [i, j  1]  1,
d [i  1, j ]  1,
s
0, if p[i ]  x[ j ]
c( p[i], x[ j ])  
1, if p[i ]  x[ j ]
h a
d [i  1, j  1]  c( p[i ], x[ j ])}
1
2
3
1
2
3
i
3
s
4
1 Sub.2
2 2
3 3
3
3
2
The edit distance between “has” and
“This” is 2 in this example.
The value in each cell is the edit distance
between the corresponding pattern and
text characters read so far.
Outline
•
•
•
•
•
•
Motivation
Preliminaries
A new semantics
Multilevel filtering index
Two verification algorithms
Experiments
(k, τ)-Matching Query
• We propose (k, τ)-matching query, which is based on a pattern string p, a
set of uncertain text strings {Xi} (1 ≤ i ≤ r), and threshold parameters k, τ,
and asks for all substrings X of Xi’s such that Pr[d(p, X) ≤ k] > τ.
• An other semantics is the EED ( expected edit distance ), which computes
the expected edit distance between p and X and check if it’s < k’.
Semantics
• (k, τ)-matching query v.s. EED
– EED first summarizes possible worlds and then apply the threshold,
hence many algorithms developed for the deterministic case are
inapplicable.
– More importantly, EED may either miss real matches or have an
unduly big threshold so that many false positives may mix in.
4
Number of matches
10
3
10
DNA: (k, )
DNA: EED
2
10
1
10
0
10
0
2
4
6
Text string size
8
6
x 10
Example : Semantics
Consider this approximate pattern matching in DNA sequence :
Pattern p, Length = 20
G
G
A
G
A
G
T
G
T
G
A
G
T
A
T
A
G
A
A
G
A
G
Substring X1 has an exact match, but 9 characters are uncertain. EED(p, X1) = 6
G
T
T
T
T
T
T
T
T
T
A
A
A
A
A
A
G
G
G
G
A
G
A
G
T
G
T
G
A
G
T
A
T
A
G
A
Substring X2 has no exact match, with 5 errors and 2 uncertain characters. EED(p,X2)=6
A
G
G
G
G
G
T
T
T
A
G
G
T
A
A
G
T
A
T
A
G
A
A
T
To find X1, with EED, the threshold should be at least 6, which means we can’t avoid X2.
But with (k, τ)-matching query, we can use a ( 2, 0 )-matching query to select X1 and avoid
X2 at the same time.
Outline
•
•
•
•
•
•
Motivation
Preliminaries
Semantics
Multilevel filtering Index
Two verification algorithms
Experiments
Left/Right Signature of a Q-gram
left signature
q-gram
right signature
RATGS
pos
Text string x
pos+q
Left signature and right signature of a q-gram.
G
Hash
01
The hash function maps a
character to a two-bit value.
• The edit distance of signatures is no more than the edit distance of
original strings if we treat each two-bit value in the signature as a
character when computing the edit distance.
Multilevel Filtering Index
28
short position list
4-byte tag
4-byte tag
4-byte tag
.....
.....
4-byte tag in the figure
contains left/right
signature of the q-gram.
72
next level
directory
99
short position list
A multilevel filtering Index for a q-gram
based on measuring signature distance.
Best Matching Prefix
x
p
G
G
P
A
P
0
1
2
3
4
5
G
1
0
1
2
3
4
A
2
1
1
1
2
3
P
3
2
2
2
1
2
“1” is the smallest in the last row. Hence, 1 is x’s prefix distance from p
and GGAP is the best matching prefix of x for p.
Dynamic Programing Verification Using
Signatures
• The value on each diagonal in the DP table form a non-decreasing
sequence, as shown in the figure.
• Let d l / d r be pl / pr ' s prefix
distance from xl / xr , The
verification requires :
dl  d r  k
We need to do this mapping
to accommodate the certain
length of signatures in the index
tag.
x’[1]
x’[l]
x’[m’]
p’[1]
p’[l−k]
p’[l]
p’[l+k]
p’[m’]
x’[m’−k]
x’[m’+k]
Dynamic Programming Verification
Using Signatures
• When pattern is exhausted. We do mapping as the following.
x’[l]
p’[m’]
x’[m’−k] x’[m’+k]
Example : Using the Index
• Input:
X: … … [DC][CB][BEST][RO][DO]… …
Q-gram “BEST” matches. We then
use this q-gram’s multi-level
signature index for filtering.
P:
[DC][AB][BEST][NR][OB]
K=2
Pattern
Text
Pattern
DPl
Text
DPr
Example : Using the Index
dl  d r  1  1  2  k
Use the first level signatures
in the index tag.
• Input:
Need to expand to next level
of the index.
X: … … [DC][CB][BEST][RO][DO]… …
P:
[DC][AB][BEST][NR][OB]
K=2
Text
Pattern
B’
C’
0
1
2
B’
1
0
1
A’
2
1
C’
3
D’
4
Pattern
Text
R’
O’
0
1
2
N’
1
1
2
1
R’
2
1
2
2
1
O’
3
2
1
3
2
B’
3
2
DPl
4
DPr
Using the Index-Example
dl  d r  1  2  k
Use the second level
signatures in the index tag.
• Input:
X: … … [DC][CB][BEST][RO][DO]… …
P:
[DC][AB][BEST][NR][OB]
K=2
Text
Pattern
B
C
C
D
0
1
2
3
4
B
1
0
1
2
3
A
2
1
1
2
C
3
2
1
3
2
D
4
DPl
This candidate position is filtered out.
Pattern is exhausted in this case.
Text
R
O
D
O
0
1
2
3
4
N
1
1
2
3
4
3
R
2
1
2
3
4
1
2
O
3
2
1
2
3
2
1
B
3
2
2
3
Pattern
4
DPr
Outline
•
•
•
•
•
Motivation
Preliminaries
Semantics
Multilevel filtering index
Two verification algorithms
– Bounds based on CDF
– Bounds based on local perturbation
• Experiments
Verification Algorithms
• The goal of verification is to conclude whether a candidate
position selected by an index is a true match.
• We present two algorithms, each of which gives an upper and
a lower bound of the probability that d ( p, x )  k .
Outline
•
•
•
•
•
Motivation
Preliminaries
Semantics
Multilevel filtering index
Two verification algorithms
– Bounds based on CDF
– Bounds based on local perturbation
• Experiments
Bounds Based on Cumulative
Distribution Functions (CDF)
• The basic verification consists of two symmetric runs of a DP
algorithm. We describe how we change such a DP algorithm
to accommodate uncertain characters.
• Our key idea is to compute (at most) k+1 pairs of values in
each cell, i.e., {( Fl [ j ], Fu [ j ]) | 0  j  k} where Fl [ j ]  Pr[ D  j ]  Fu [ j ],
D denotes the edit distance.
•
A Basic Step
Consider a basic step: how do we get D from its 3 neighbors?
p1  Pr[C  c] //probability of a match at cell D
Pick one fixed neighbor
p2  1  p1
D1
D2
Fl [ j ]  p1Fl [ j ]  p2 Fl
D3
D
(1)
(arg min i Di )
[ j  1]
cell to use.
3
Fu [ j ]  p F [ j ]  p2 min(  Fu( i ) [ j  1],1)
(1)
1 u
i 1
Use the union of upper bounds
from the three neighbors
argmini Di returns the index value i that minimizes Di; the minimization is
defined as the Di (1 ≤ i ≤ 3) that has the greatest Fl (i ) [0] ( i.e. the one that
has a small distance value with the highest prob. ).
Example : Bounds Based on CDF
• Consider p = “CAT” and X is “C” followed by four characters, each of which
has the same distribution G.1A.4T.5, denoting that it is G (A, T) with
probability 0.1 (0.4, 0.5). K = 2.
• Take the cell at the 3rd row and the 4th column as an example. How do we
compute Fl[j] & Fu[j]?
argmini Di is D3 Mixture of the upper
bound in D1 and the
union of those in all
three neighbours.
p
x
C
G.1A.4T.5
G.1A.4T.5
(1, 1)
(0, 0)
(1, 1)
(0, 0)
(0, 0)
(1, 1)
(0, 0)
(1, 1)
(.4, .4)
(1, 1)
(0, 0)
(0, 0)
(1, 1)
(0, 0)
(.7, .7)
(1, 1)
C
A
T
G.1A.4T.5
G.1A.4T.5
(0, 0)
(0, 0)
(.64, .64)
(0, 0)
(1, 1) (.784, .784)
(.2, .2)
(.7, .7)
(1, 1)
(0, 0)
(0, 0)
(.42, .42)
(0, 0)
(.85, 1) (.602, .602)
Outline
•
•
•
•
Motivation
Preliminaries
Semantics
Multilevel filtering technique based on
measuring signature distance
• Two verification algorithms
– Bounds based on CDF
– Bounds based on local perturbation
• Experiments
Bounds Based on Local Perturbation
• Adjacent and remote possible worlds: Give a (k, τ)-matching query on a
pattern p and on an uncertain text X, we say that a p.w. w of X, denoted
x(w) is adjacent to p if d ( p, x( w))  k . We say that it is remote to p
if d ( p, x( w))  k.
Perturbation
G/A/T
A/T
…
A/T
An initial adjacent\remote p.w.
G/A/T
A/T
G/A/T
…
A/T
G/A/T
…
A/T
G/A/T
…
A/T
A/T
…
A/T
…
A/T
…
A/T
…
…
More adjacent\remote p.w. after perturbation.
How to Get Initial Adjacent/Remote
Possible World.
K=1
p
x
C
G.1A.4T.5 G.1A.4T.5 G.1A.4T.5
0
1
2
3
4
C
1
0
1
2
3
A
2
1
0
1
2
3
2
1
0
1
T
C
G.1\A.4\T.5
G.1\A.4\T.5
Use a randomized algorithm to
get a remote p.w. :
G.1\A.4\T.5
Get a closest p.w. ( an
adjacent p.w. with the
smallest distance ) on the
optimal path in DP table.
D1
D2
D3
D
If D1 and D contain the
same distance value, the
corresponding variable
in the test string is called
a crucial variable of this
p.w.
Perturbation – How?
Let  be the difference between k and the
edit distance of this p.w and the pattern.
T/.5
G
T/.5
T/.5
T/.5
…
A/.4 A/.4 A/.4 A/.4
…
G/.1 G/.1 G/.1 G/.1
…
Suppose there’re c crucial variables in
this closest/remote p.w.
Text string x
For an closest p.w. :
Pr(at most  crucial variables ( of the total
c curial variables ) change their values as in
the optimal path ) is a lower bound
of Pr[ d ( p, X )  k ] .
For a remote p.w. :
We could get a upper bound similarly.
Outline
•
•
•
•
•
•
Motivation
Preliminaries
Semantics
Multilevel filtering index
Two verification algorithms
Experiments
Setup of Experiment
• We examine the behaviors of (k, τ)-pattern matching query using signature
filtering and verification bounds with the following dataset.
• The DNA dataset.
– Raw datasets of sequencing runs of Escherichia coli 536 from the NCBI SRA
(Sequence Read Archive) database.
– Use Bowtie to align the short DNA sequences with the complete
Escherichia coli genome reference. The mapping reports output by Bowtie
show positions within the DNA that have more than one possible value.
• The protein dataset.
• Synthetic datasets.
– We generate a few synthetic datasets based on the two real datasets
above.
– Vary the parameter values of data (such as the uncertainty ratio θ) or the
size of the data.
Experiments
10
10
2
1
sig.; bounds
no sig.; bounds
no sig.; bounds 2
sig.; no bounds
no sig.; no bounds
0
-1
0
2
4
6
Text string size
8
x 10
Running time for
various settings.
6
10
10
10
4
10
3
2
1
0
10
0.1
0.2
0.3

0.4
Varying θ.
0.5
4
1000
with signatures
without signatures
10
10
10
Execution time (seconds)
10
10
Execution time (seconds)
10
3
Execution time (seconds)
Execution time (seconds)
10
3
2
800
I/O time
CPU time
600
400
200
1
0
2
4
Text string size (in GB)
Using larger
synthetic data.
6
0
2GB(i) 2GB(ii) 4GB(i) 4GB(ii)
(i) with signatures (ii) without signatures
Breakdown of I/O
& CPU costs.
Experiments
10
4
3
no sig.
sig.
no sig., after bounds
sig., after bounds
2
0
2
4
Text string size
6
8
x 10
# of positions to be
verified.
6
10
10
10
10
4
10
3
2
1
0
10
20
30
|p|
40
50
Varying |p| (DNA).
10
10
10
3
2
10
signatures; bounds
no signatures; bounds
signatures; no bounds
no signatures; no bounds
1
Execution time (seconds)
10
10
Execution time (seconds)
10
5
Execution time (seconds)
Number of positions to be verified
10
0
10
20
30
|p|
40
Varying |p|
(protein).
50
10
10
10
3
2
1
0
0
1
2
k
3
Varying
threshold k.
4
Conclusions and Future Work
• We study a real and unsolved problem of approximate substring matching
over uncertain texts.
– Proposed a novel semantics and demonstrate its advantages over an
alternative one introduced by previous work.
– Developed a q-gram based index to handle uncertain texts.
– Proposed two efficient verification algorithms.
• As future work, we plan to study the matching problem under correlated
uncertainty to address a wider range of applications.
Thank You!
• Questions?
Download