Marjolijn, Adriano

advertisement
Indexing Text with
Approximate q-grams
Adriano Galati & Marjolijn Elsinga
Overview
•
Approximate string matching
- Neighborhood generation
- Reduction to Exact Searching
- Intermediate Partitioning
•
•
•
Indexing text using q-grams
Filtration condition
Finding approximate q-grams
- Trie data structure
- Non-deterministic automaton (NFA)
•
Parameters
Approximate string matching
Text T1.. n
Pattern P1.. m
Goal: Retrieve all occurrences of P in T whose
edit distance is at most k
Edit distance: ed ( A, B )
Solutions
All kinds of solutions, most investigated area
in computer science
In on-line versions of the problem the
pattern can be preprocessed, the text
cannot
Classical solution: using dynamic
programming and a matrix is O(mn)
Classical solution
Fill matrix C0.. m ,0.. n where Ci , j ìs the minimum edit
distance between P and a suffix of T
Initialize the borders with Ci ,0  i and C0, j  0
Fill internal cells with Ci 1, j 1 if Pi  T j
1  min( Ci 1, j , Ci 1, j 1 , Ci , j 1 )
Solution (2)
If text is large, on-line algorithms are not practical
and preprocessing becomes necessary
Focus: Sequence retrieving indexes, with no
restrictions on the patterns and the occurrences
Approaches:
•
•
•
Neighborhood Generation
Reduction to Exact Searching
Intermediate Partitioning
Neighborhood Generation
Set of strings matching a pattern with k errors is
finite (U k (P))
Therefore it can be enumerated
Each string U k (P) can be searched using a
data structure
This structure is designed for exact matching
Neighborhood Generation (2)
+ O(n) space and construction time
- Not optimized for secondary memory
- Inefficient in space requirements
Is promising for searching short patterns
only
Reduction to Exact Searching
Indexes based on filters
Filter checks for simpler condition than the
matching condition, discarding large parts of the
text
Main principle: if two strings A and B match with k
errors and k+s non-overlapping samples are
extracted from A, then at least s of these must
appear without errors in B
Reduction to Exact Searching (2)
+ can be built in linear time and need O(n)
space
+ with some method it is possible to make
an index that takes less space then the
text itself
- Are based on suffix trees or on indexing all
the q-grams
Intermediate Partitioning



Reduces the search to approximate search
instead of exact search
Main principle: if two strings A and B match with
at most k errors and j disjoint substrings are
taken from A, then at least one of these appears
in B with k / j 
Split the pattern in j pieces, search each piece
in the index allowing k / j  errors, extend the
approximate matches to complete occurrences
Question (Ingmar)
I think the main principle is incorrect, because if
AAABBBBBB
BBBBBBBBB
These match with k=3 errors. If we take the
disjoint substrings AAA BBB BBB so j=3. Now
they say that one of these will appear in the
other with 3 / 3  1 errors. However AAA match
with 3 errors, BBB with 0 and BBB with 0
Answer
The pattern is split in j pieces, each piece is
searched in the index allowing k / j  errors
AAA BBB BBB
BBB BBB BBB
We match BBB with ABB and not with AAA and
AAB, because it is not possible to match them
with more then k / j  errors, with k=3 and j=3,
unless we change the parameters
Intermediate Partitioning (2)
+ optimizing point between neighborhood
generating (worse with longer pieces) and
reduction to exact searching (worse with
shorter pieces)
Has been used on the patterns but not yet
on the text itself
Indexing text using q-grams
Steps:
• Filtering text
• Finding approximate q-grams
Advantages:
• Takes little space
• Has an alternative tradeoff
• User can decide what is important: saving space
or better performance
Filtration condition
Based on locating approximate matches of
pattern q-grams in text
Leads to a filtration tolerating higher error
levels compared to exact q-gram matching
Condition for an approximate match
Two strings A and B
ed ( A, B)  k
A  A1 x1 A2 x2 ...x j 1 A j
Now: at least one string Ai appears in B with at most
errors k / j 
Only the q-grams for which this hold, will be used for
searching
Example: Condition
A: CCTC TCTC CCCT
B: CCCC CTCT TCTC
We see: k=8
We take: j=3
Now e=2, so at least one Ai appears in B
with at most 2 errors
Question (Peter)
“Note that it is possible that j  k / j   k , so we are
not only ‘distributing’ errors across pieces, but
also ‘removing’ some of them”
How does this work?
Answer
k=5
j=3
e=1
A1
x1
A2
x2
A3
Q-grams vs. Q-samples
Q-grams overlap
Q-samples do not overlap
String: ABCDEF
Q-grams: {ABC, BCD, CDE, DEF}
Q-samples: {ABC, DEF}
In a q-gram index all the text q-grams are stored in
increasing order
In a q-sample index only some text q-grams are
stored
Constructing q-samples
We need to extract j pieces from each potential pattern
occurrence in the text
So: a q-sample every h text-characters
We need to guarantee that j q-samples are inside any
occurrence of P
Minimal length of P = m-k
 m  k  q  1
h

j


Question (Jacob)
Could you please explain how the restriction
of h is built up?
Answer
# q - samples  text  q  1
n  P  q 1
n  m  k  q 1
n m  k  q 1

j
j
m  k  q 1
h
j
Next step
Best match distance (bed) is calculated for
each test sequence of q-samples
This is the distance between the q-sample
sequence and the involved text (h)
The text area h is only examined if its bed is
at most k
Algorithm
Each q-sample sequence has its own counter M
M indicates the number of errors produced by the
q-sample sequence and is initialized to
M  j (e  1)
So: we start by assuming that each q-sample
gives enough errors to disallow a match
Error-environment
After calculating the M for each q-sample
sequence, we obtain the e-environment of
each q-sample sequence
This is the set of possible q-samples that
appear inside the q-sample sequence with
at most e errors
Finishing
Now all text areas have its own e-environments
connected to it through the q-samples
They can be checked with dynamic
programming
Finding approximate q-grams
Finding all the text q-samples that appear inside a
given pattern block Qi
Note: it is not necessary to generate all U eq (Qi ) since
we are interested only in the text q-samples
(position)
Ieq (Q)  {r 1.. n / h , bed (dr , Qi )  e}
Finding approximate q-grams (2)




Idea: to store all the different text q-samples in a
trie data structure
We fill in a matrix C0..q ,0..|Q| such that l is the
sed between S1..i and a suffix of Q1..l
S is relevant  Cq ,l  e for some l
In a trie traversal of the q-samples, the characters
of S are obtained one by one
Question (Laurence)

Can you please show me the matrix is
build on page section 3 in fig. 4? It is a bit
unclear to me how the matrix is initialized
and the different cells are being filled.
Answer
Ci , j  if Si  Q j then Ci 1, j 1 else 1  min(Ci 1, j , Ci 1, j 1 , Ci , j 1 )
Answer
s
u
r
g
e
r
y
0
0
0
0
0
0
0
0
s
1
0
1
1
1
1
1
1
u
2
1
0
1
2
2
2
2
r
3
2
1
0
1
2
2
3
v
4
3
2
1
1
2
3
3
e
5
4
3
2
2
1
2
3
y
6
5
4
3
2
2
2
2
Finding approximate q-grams (4)
When we reach the leaf nodes (depth q) we check
in if there is a cell with value  e  the
corresponding text is reported
Complexity O(| Q | q)  O(mq)
Finding approximate q-grams (3)
Pruning:
•
•
All the value of a row to the next are nondecreasing
If all the values of a row are larger than
at that
point we can abandon that branch of the trie
e
Finding approximate q-grams (5)
Alternative way:
•
To model the search with a non-deterministic
automaton (NFA)
Finding approximate q-grams (6)
Consider the NFA for e  2 errors
Every row denotes the number of errors seen
Every column represents matching a prefix of S
Horizontal arrows represent matching a
character
All the others increment the number of errors
Question (Bogdan)
I can imagine how the trie can be used together with the
matrix in order to benefit from common prefixes of
certain q-samples (by reusing the rows of the matrix
which are already computed for the common prefix).
However, I don't see how this can be done in the case of
the NFA. If it can't be done, this would mean that the
algorithm has to be run separately for each q-tuple,
which probably makes the NFA approach much worse.
Am I right to think that or is there a way to run the NFA in
a "smarter" way so as to benefit from common prefixes?
Bogdam (answer)

Yes, you are right, the algorithm has to run for
each q-tuple, but you have to consider the
complexity of it, that is linear O (e)
Parameters of the Problem



Smaller e value the search of e-environment
will be cheaper
Larger e value gives more exact estimates of the
actual number of error but with a higher cost to
search the e-environment
As j grows, longer test sequences with less
errors per piece are used the cost to find the
relevant q-samples decreases but the amount of
text verification increases.
Parameters of the Problem (2)
1.
2.
Notice: the index of this approach only
stores non-overlapping q-samples, its
space requirement is small
Notice: the space consumption of index
depends on the interval h
Parameters of the Problem (3)





Standard implementation q-gram index stores all
the locations of all the q-grams of the text
The number of q-grams  n  q  1
Storing a position takes log n
space consumption is n log n
Ratio between this method and standard
approach
n / h log(n / h) 1
vr 
n log n

h
Question (Bogdan)
Could you please explain what the
"columns" used in the 5th section are?
 The table shows how the error level
increases the number of processed
columns of matrix or NFA

Question (Lee/Bram)


The article talks about disjoint non overlapping
q-grams. At the end they say that will probably
enhance the scheme that they allow overlapping
q-grams. Any idea how our current algorithms
have to be changed for that and what the
advantages are?
http://www.cs.utexas.edu/users/mobios/MoBIoS
Papers/2003-IndexingProteinSequences-TR-0406.pdf
Question (Lee)
In the second paragraph of section 4 they
say “In that particular case we can avoid
the use of counters…” Can you explain
that ?
Answer
The error counters M are initialized at a high value
After that all pattern-blocks are compared to the
corresponding text piece and the counter value is
updated to a lower value
In this particular case, when e = k / j  the error counter
can get as low as k+1, which is higher than the initial
value
Any other questions?
Download