Matching and Indexing Sequences of Different Lengths

advertisement
Matching and Indexing Sequences of Different Lengths Tolga Bozkaya
Nasser Yazdani
Meral Özsoyoǧlu
Department of Computer Engineering and Science
Case Western Reserve University
Cleveland, OH 44106
fbozkaya, yazdani,ozsoyg@ces.cwru.edu
Abstract
and to find the correspondence among the elements of
the sequences. To answer similarity match queries, we
certainly need an efficient index structure. We can not
use conventional index structures, since we do not assume
any geometry on the domain of sequences. Instead, we
designed an indexing scheme which is based on the lengths
of sequences and relative distances between sequences. A
distance-based index structure, vp-tree [Uhl91], is used as
the underlying index structure in our method. The indexing
scheme is used as a major filtering mechanism to eliminate
distant sequences in processing a similarity match query.
The rest of the paper is organized as follows. In
section 2 we provide a brief overview of the previous work
on sequence matching problem. In section 3, we give
the motivation for our work, and present the definition
for matching sequences. Section 4 contains our methods
for computing the similarity between sequences, and the
matching process. In section 5, we propose a general method
for indexing sequences (of different lengths) for similarity
match queries with respect to the matching process of section
4. Section 6 concludes.
In this paper, we consider the problem of efficient matching and
retrieval of sequences of different lengths. Most of the previous
research is concentrated on similarity matching and retrieval of
sequences of the same length using Euclidean distance metric. For
similarity matching of sequences, we use a modified version of the
edit distance function, and consider two sequences matching if a
majority of the elements in the sequences match. In the matching
process a mapping among non-matching elements is created to
check if there are unacceptable deviations among them. This
means that two matching sequences should have lengths that are
comparable. For efficient retrieval of matching sequences, we
propose an indexing scheme which is totally based on lengths
and relative distances between sequences. We use vp-trees as the
underlying distance-based index structures in our method.
1 Introduction
The problem of matching sequences with respect to a
similarity measure is encountered in a variety of applications
such as text and information processing, genetics, time series
analysis, scientific databases, etc. It is important for these
applications to efficiently identify and retrieve similar data
items (sequences) to a given query item. The results of
these queries can be used for different purposes such as
information retrieval, data mining, classification, etc.
In this paper, we address the general problem of indexing
and matching sequences of different lengths while putting
more emphasis on numerical sequences from scientific
experiments since that was our motivating application. Still,
our methods are general and can be applied to other data
domains with sequences.
In our work, we use a modified version of the edit distance
function to compute the distance between two sequences,
2 Related Work
To our knowledge, [AFS93] is the first work which proposes
a solution for similarity matching of sequences. In [AFS93],
it is assumed that all sequences are of the same length, and
each sequence is considered as a point in an N -dimensional
space. Then, two sequences are considered similar when the
Euclidean distance between them is less than a threshold
value . The authors use R -tree [BKSS90] as the index
structure. Sequences are represented as K -dimensional
points using K features for each sequence. Discrete Fourier
Transform (DFT) is used for feature extraction since it
preserves the Euclidean distance.
Faloutsos et al. extend the method proposed in [AFS93]
to locate subsequences that match a query sequence or a subsequence of it [FRM94]. DFT is used for feature extraction,
which preserves the Euclidean distance between sequences.
The fact that it is a distance preserving transformation makes
DFT attractive for indexing. However, it could be used only
for sequences of the same length. Also, it is not very effec-
This research is partially supported by the National Science Foundation grant IRI 92-24660, and the National Science Foundation FAW award
IRI-90-24152
1
tive for sequences with mostly uncorrelated elements (such
as random vectors). Another feature extraction method for
sequences is the Eigenvector method [Fuk72]. The Eigenvector method also preserves Euclidean distance, however,
it also works with sequences of the same length. Rakesh
Agrawal et al., [ALSS95], gives a method to retrieve similar
sequences in the presence of noise, scaling and translation in
time series.
In [LT94], a modified version of edit distance function
is used to find the matching text to a hand-written text
using pen-stroke data. This approach is similar to ours
in the sense that we also use a modified version of editdistance function to compute similarity between sequences.
However, we use different cost functions and we also create
a one-to-one mapping between matching and non-matching
elements of sequences making deletions and insertions (via
interpolation) if necessary.
A sequence matching method for sequences of different lengths based on dynamic programming is proposed in
[YO96]. A modified version of the Longest Common Subsequences (LCS), [CLR90], was used for actual matching of
the elements of two sequences. For filtering, a feature-based
indexing mechanism is used. where the length, mean, and
variance (first two moments) of each sequence are used as
features.
Indexing sequences to efficiently handle the similarity
matching queries is also an important problem. In terms
of indexing, we see that a good portion of the previous
work [AFS93, FRM94] concentrates on matching sequences
where the similarity between sequences are assumed to
be Euclidean. However, note that for many domains and
applications (such as text databases, genetic sequences)
Euclidean distance function cannot be used directly as a
similarity metric because the domain is non-numeric and
sequences may be of different lengths. In these cases,
indexing methods for Euclidean spaces are usually not
applicable. Still, there are other distance-based index
structures [Uhl91, Bri95] that do not assume any geometry of
the application domain, but only depend on the fact that the
distance function is metric (See section 5 for definition). The
simple idea behind these structures is to use some reference
object(s) to partition the search space with respect to the
distances of data objects to the reference object(s). At the
time of search, some of these partitions are not searched any
further depending on their relative distances to the reference
point(s) and the distance(s) between the query point and the
reference point(s). The vp-tree [Uhl91], a distance-based
index structure, is explained in more detail in section 5, since
it is used as part of our indexing scheme.
are important to study the suitability of different materials
for many engineering applications. The indexing method
extracts some features such as the number of vertices in the
polygonal approximation of the shapes and the number of
components, and indexes them with respect to these features.
Extracted features are represented as a feature vector for
each image. [YO96] introduces a method for matching
sequences of damage zone shapes in particular, and images
from lab experiments in general. This method also uses the
idea of feature extraction and representation of images by
their feature vectors where each feature is represented by
a numerical value. Therefore, each image sequence can be
represented by a N M matrix where N is the number of
features for each image and M is the number of images
in the sequence. The extracted features can also be the
amount of change of some features like area, width, etc, in
successive frames. Using such features would be helpful
in studying change patterns by identifying sequences with
similar patterns via similarity match queries.
For the feature extraction method to be effective, the
following conditions must hold. Let R and S be two image
sequences and FR and FS be their feature vectors. Then,
If R and S match =) FR and FS match.
If R is similar to S =) FR is similar to FS.
Using the feature vector representation, the problem of
matching image sequences is transformed into the problem
matching sequences of numerical values in an N dimensional space. It may be the case that the number of images is not guaranteed to be the same for different sequences,
even for the sequences obtained by repeating one experiment
(since sampling rates may be different, and some data elements might be lost in the sampling period). We should be
able to compare these sequences since such comparisons are
important to verify the results of previous experiments and
identify common patterns in these experiments.
In this paper, we use the definition for matching two
sequences of different lengths given in [YO96], which uses
the same idea of transforming sequence matching problem
to numeric sequence matching in the way discussed above.
Before we generalize and give Definition 3.1 for matching
sequences in scientific experiments, we would like to address
the following points for motivation.
3 Motivation and Problem Definition
In an earlier work [YOO94], we proposed a feature-based
indexing method for exact as well as similarity matching
of damage zone shapes (areas) in polypropylene that result
under high stress at low temperatures [SHB92]. Such queries
2
The relative times that the corresponding samples are
taken are almost the same in both sequences. This means
that the lengths of sequences should be close to each
other to be matched.
The elements of both sequences are taken from the
lifetime of the experiment in a rather uniformly manner.
Two sequences can be considered matching (or similar)
if majority of their elements match.
In numeric sequences from scientific domains, since the
elements are real numbers obtained during the experiment with a limited precision, elements from different
4 Matching Process
sequences should be matched based on proximity. In
non-numeric sequences, matching is usually done based
on equality.
Since our method uses a modified version of edit distance
for approximate text matching, first, we briefly review the
edit distance and approximate text matching problem. Next,
we discuss our method for similarity matching of numeric
sequences of different lengths.
Definition 3.1 Assume two sequences S and Q have lengths
M and N respectively, and si1 ; si2 ; :::; siK are matching
qj1 ; qj2 ; :::; qjK . The sequences S and Q match each other
if
1.
4.1 Edit Distance and Approximate Text Matching
The edit distance between two alphanumeric sequences,
(referred as text and pattern) is defined [CR94] as the
minimum number of operations that are needed to change
a query text into a pattern. Three operations delete, insert
and change are allowed. Assuming the cost of each of these
operations is 1, the edit distance is the minimum number of
operations needed to obtain a pattern from a text. A dynamic
programming solution for this problem is given in [CR94].
Let D[i; j ] denote the minimum edit distance between two
texts Q[1 : : : i] and S [1 : : : j ]. Then, the minimum distance
is defined recursively as follows,
minfN; M g maxfN; M g where 1. Here, is referred as the length aspect ratio, and it is preferably
close to 1.
2.
distance(sik ; qjk ) for all k matching elements, 1 k K where K is the number of matching elements. is referred as the matching distance.
3.
K minfN; M g where 1. is referred as the
8
>>
<
D[i; j ] = >
>:
matching coefficient, and it is also preferably close to 1
for higher selectivity.
4. A one-to-one mapping can be found between all unmatched elements (making insertions or deletions if necessary) of the two sequences such that for all mapped
elements si and qj , the relation distance(si ; qj ) < holds ( 0). (Most likely, will be a function of if
> 0)
j
If i=0;
i
If j=0;
D[i-1,j-1]
If i; j > 0 and qi is equal to sj ;
Otherwise;
min D[i-1,j-1]+1,
D[i-1,j]+1, D[i,j-1]+1
(1)
f
g
The values in the last part of this formula correspond to
change, delete and insert respectively. Formula (1) also introduces an algorithm to find edit distance between a text
Q and a pattern S with lengths N and M in time proportional to O(MN ). A particular case of the edit distance
algorithm gives the longest common subsequence (LCS) of
two sequences [CR94]. Let us assume D[i; j ] is the minimal number of deletions and insertions, not changes, necessary to transform a text Q[1 : : : i] into S [1 : : : j ]. Evaluation of D[i; j ] is equivalent to computation of the length
of LCS between Q[1 : : :i] and S [1 : : : j ]. Indeed, this is a
restricted version of edit distance where changes are not allowed. However, we will refer to it as as the edit distance in
this paper.
Assume C [i; j ] is the length of the LCS of Q[1 : : : i]
and S [1 : : : j ]. The following lemma shows the relationship
between C [i; j ] and D[i; j ] for two sequences Q and S with
lengths N and M . Due to space limitations, for the proof
of the following lemma and the other proofs, please refer to
[BYO97].
Note that the distance between two elements si and qj
can be defined in a different way for each domain. For
numeric sequences, distance(si ; qj ) is simply jsi ? qj j. This
simple distance function can’t be used for every domain
such as multi-dimensional vector sequences or non-numeric
sequences. As mentioned before, in most applications, the
elements of non-numeric sequences can be matched based on
equality ( = 0). In that case, the distance between any two
elements is defined to be 0 if they are equal, and a positive
number if they are not.
The method proposed in [YO96] which is based on
the Longest Common Subsequence problem does not provide any mapping for the unmatched elements and checking the condition 4 of the definition above. Note that
mapping unmatching elements is important for some domains (ex:scientific experiments) where it is not desirable
for matching sequences to have large deviations along the
unmatched elements.
Most of the sequence matching methods, in an environment with a large set of data sequences, work in two phases.
In the first phase, a finite number of data sequences are
filtered out by searching in an index structure. These sequences are hypothesized as matching candidates with the
given query sequence. In the second phase, all hypothesized
sequences are verified for actual matching. Our method also
works in two phases. In section 4, we explain our method
for matching two sequences. Indexing and filtering will be
discussed in section 5.
2C [Q; S ] = M + N ? D [Q; S ] and
C [i; j ] = i + j ? D[i; j ] for 0 i N; and 0 j M:
Lemma 4.1 [CR94]
2
This is an interesting result. It indicates that finding the
restricted edit distance is a dual problem of finding the LCS
problem, therefore, any solution to it automatically gives the
answer to the LCS of Q and S .
4.2 Edit Distance and Sequence Matching
We now explain how the edit distance can be applied to find
matching sequences of different lengths.
3
1. Instead of checking equality between two elements qi
and sj from two sequences Q and S (respectively), we
check if the elements are within a matching distance
from each other. In other words, it is checked whether
the relation distance(sj ; qj ) holds.
In computing the values of D[i; j ]s in Formula 3, we give
the preference from left to right. Therefore, whenever two
or three expressions give the same value for the minimum,
we choose change or insertion first. Deletion will be the
last choice. Nevertheless, in order to apply Formula 3 to
the sequence matching problem, the original values are kept
in case of changes. Applying Formula 3 to the sequences
of Example 4.1 gives the alignment shown in Figure 2.
The dashed boxes show the mapping among unmatched
elements. As it is shown in Figure 2, it gives a better
mapping and a better alignment than the one obtained using
Formula 2.
2. Change is not allowed in computing the edit distance.
3. For numeric sequences, in insertion, the new elements
are calculated by interpolation instead of just inserting
elements from other sequences. For instance, the value
qi +qi+1 would be inserted between q and q .
i
i+1
2
Formula 2 below calculates the edit distance between two
sequences Q and S .
8>
<
D[i; j ] = >
:
j
If i=0;
i
If j=0;
D[i-1,j-1]
If i; j > 0 and qi matches sj ;
min D[i-1,j]+1, D[i,j-1]+1
Otherwise ;
(2)
f
g
The mapping between matched and unmatched elements
can be found easily from this formula.
Example 4.1 Consider two sequences Q =<2.2, 3.9,
2.9, 1.9, 4.5, 3.2> and S =<3.2, 4.2, 2.1, 3.3, 4.1, 4.3,
3.1> and let us assume the matching distance as = 0:5
by applying [BYO97] formula 2 to sequences Q and S
which is shown in Figure 1. As Figure 1 illustrates, some
elements of Q are deleted and also some new elements are
inserted by interpolation. The matching elements from the
two sequences are shown by vertical lines. The method
extends and projects Q in such a way that both sequences
have the same length in the end. Besides a bad alignment,
S
3.2
4.2
2.9
2.4
2.1
3.3
4.1
1.9
2.8
3.7
4.3
S
3.2
4.2
2.1
3.3
4.1
4.3
3.1
Q
2.2
3.9
3.4
2.9
1.9
4.5
3.2
The new inserted elements
Figure 2: An alignment obtained by applying Formula 3.
The running time for finding the edit distance between
two sequences Q and S with applying Formulas 2 and 3
is O(MN ) where N and M are the lengths of Q and S
respectively. However, since the definition for matching
of two sequences (Definition 3.1) puts a restriction on
the number of non-matching elements, the formulas, and
consequently the algorithm, can be modified in order to make
it more efficient. The following lemma states this fact.
3.1
3.2
Lemma 4.3 Assume Q and S with lengths N and M are
two sequences which match each other with a matching
coefficient , and D[S; Q] denotes the edit distance between
S and Q. Then, D[S; Q] M + N ? 2minfM; N g.
there is a subtle flaw in this method. It does not use the
unmatched elements of the query sequence. Hence, formula
2 is modified as follows:
Lemma 4.3 bounds the value of edit distance between two
matching sequences and this can be used to expedite the
matching process by giving a tool to exit the algorithm when
two sequences do not match. It also gives a powerful tool to
limit the range of operations in finding the edit distance.
Let us assume li and ri are two lines which identify the
boundary of the range and are defined as follows,
Q
2.2
3.9
Deleted elements
4.5
New inserted elements
Figure 1: An alignment for sequences Q and S .
8
>>
<
D[i; j ] = >
>:
j
If i=0;
i
If j=0;
D[i-1,j-1]
If i; j > 0 and qi matches sj ;
min D[i-1,j-1]+2,
Otherwise ;
D[i-1,j]+1,D[i,j-1]+1
(3)
f
li = maxf0; i ? N + minfM; N gg
g
ri = minfM; i + M ? minfM; N gg
This means changes are allowed, however, weight 2 is given
to changes, while weight 1 is given to each deletion and
insertion. The following lemma shows that Formula 3 finds
the same edit distance as Formula 2 does.
Then, the recursive formula to find the edit distance of
two matching sequences Q and S , and, consequently, the
correspondence between their elements, can be found using
the formula below. The formula is defined for D[i; j ] in the
Lemma 4.2 Formulas (3) and (2) are equivalent in the sense
that they compute the same edit distance.
4
range li
j ri .
8
>>
>>
>>
><
D[i; j ] = >
>>
>>
>>
:
If i=0 and 0
If j=0 and
j
i
jr
set of sequences as matching candidates for a given query
sequence. Here, we propose an indexing scheme for efficient
processing of similarity search queries when restricted edit
distance between the sequences is used as the similarity
measure. We will present this indexing scheme as a general
solution for the problem of matching sequences of different
lengths. The matching of elements are based on equality,
that is, the distance between any two sequence elements will
be zero if they are equal, and a positive number if they are
not ( = 0). Note that, edit distance is not metric if elements
of sequences are matched based on proximity with a nonzero (See Example 5.1). A metric distance function is
required for an index structure to be used. Also, note the
difference between the distance between sequence elements
and the distance between sequences. Below, the distance
function refers to the distance function for the sequences if
not specified otherwise.
i
i N ? bminfM; N gc;
If i > 0; li j ri and
qi matches sj ;
If i > 0; li < j < ri and
min D[i-1,j-1]+2,
D[i,j-1]+1,D[i-1,j]+1
qi does not match sj ;
min D[i-1,j-1]+2,
If i > 0; j = li > 0;
D[i-1,j]+1
min D[i-1,j-1]+2,
Ifi > 0; j = ri > 0;
D[i,j-1]+1
D[i-1,j-1]
f
f
f
g
g
g
(4)
This formula computes the modified restricted edit distance
function. However, for simplicity, we will call it modified
edit distance.
Theorem 4.1 Assume that we have two sequences Q and
with lengths N and M respectively. Without loss of
generality, let M N , and N M so that their lengths
satisfy the first condition of definition 3.1. Let D[Q; S ] and
ED[Q; S ] denote the modified edit distance (Formula 4),
and the edit distance (Formula 3) respectively between Q
and S .
S
A distance function d(x; y ) is metric if it satisfies the
following simple conditions:
(i) d(x; y ) = d(y; x)
(ii) 0 d(x; y ) for x 6= y
(iii) d(x; x) = 0
(iv) d(x; y ) d(x; z ) + d(y; z ) (triangle inequality)
1. If the sequences S and Q match each other with matching
coefficient , then D[Q; S ] = ED[Q; S ].
The edit distance function, when elements are matched
with respect to their equality ( = 0), is a metric distance
function. The following example shows that edit distance is
not metric if elements are matched based on proximity for a
non-zero .
2. If the sequences S and Q do not match each other, then
D[Q; S ] ED[Q; S ]
Before analyzing the running time of the algorithm for the
edit distance, we mention the following lemma from [YO96],
where is the length aspect ratio (LAR) for sequences Q and
S.
Example 5.1 Consider three sequences of length 2 x :<
0:5; 0:5 >, y :< 0:55; 0:55 >, z :< 0:6; 0:6 > and :0.09.
Note that although y matches with both z and x, x and z
do not match at all, causing the violation of the triangle
inequality (d(x; z ) > d(x; y ) + d(y; z )).
Lemma 4.4 Assume two sequences, Q and S with lengths
N and M , respectively, are matching. Then, N M N .
Theorem 4.2 The edit distance of two sequences Q and S
with lengths N and M , length aspect ratio and matching
coefficient can be found in O(MN [1 ? 2 ]) in the worst
case. Furthermore, the corresponding element of S and Q
can be constructed from the table in order to find the edit
distance in O(M + N ).
We have previously proposed [YO96] a technique for
indexing sequences based on some of their features, namely,
lengths and the first two moments (mean and variance).
Each data sequence is represented by a point in the feature
space. Any multi-dimensional point access method can
be used as the index structure. Similarity match queries
are transformed into range queries in the feature space.
However, this method does not guarantee that the search
process does not miss any probable matching sequence in
similarity matching process. This problem (referred to
as false dismissal problem) originates from the fact that
estimated values for mean and variance may have errors.
Therefore, the search bounds which are computed based on
these parameters may contain errors that may cause false
dismissal of some matching sequences.
Asuming the sequence elements are matched based on
equality, we present here an indexing method which guarantees that there won’t be any false dismissals. A framework
and an analysis for indexing for similarity search when se-
The values of and determine the running time of the
algorithm. For small values of and , the running time
approach to O(MN ). If both and approach to 1, the
running time approaches O(N ) which is expected since
the problem is changed into the whole matching of two
sequences with the same length.
5 Indexing and Filtering for Similarity
Search Queries on Sequences
Applying a sequence matching algorithm to a large set of
sequences in a sequential manner is very time consuming.
The main challenge is to efficiently hypothesize a small
5
= = 100 = l(c11 ), c1 ; ::; c11 are put into g1 . c12 ; ::; c24
go into g2 since d(l(c12 ) = 101)=e = 113 = l(c24 ). The
last group g3 takes c25 ; ::; c35 .
Note that, because of the way we do the grouping, all
sequences in a given group gj have comparable lengths with
respect to each other as any two sequences in gj have the
length aspect ratio at most .
Each of these groups will be indexed separately by a vptree. Our objective in forming the groups g1 ; ::; gn was
to limit the number of index structures visited during a
similarity search query. The following theorem formally
states the upper bound for the number of groups to be
searched for a similarity search query. Its proof follows from
Lemma 4.4.
90
quence elements are matched approximately can be found
in [BYO97]. Our indexing structure is constructed as follows. First, we group the data sequences with respect to
their lengths. Each group accommodates sequences that are
lengthwise close to each other. Second, the data sequences
in each group are indexed by a vp-tree [Uhl91]. Having constructed the index structure, a similarity search proceeds in
two phases. First, we identify the groups that accommodate
data sequences that may match the query sequence. This
is done based on the length of the query sequence. Then,
in the second phase, the vp-trees for these identified groups
are searched to filter out the data sequences that are distant
from the query sequence, and to find the ones that are within
matching distance to the query sequence.
In the following subsections, we explain how grouping
and indexing is done.
Theorem 5.1 There will be at most 3 groups visited for a
similarity search query.
5.1 Grouping Data Sequences by Length
For a similarity search query, the simplest way to filter out
many distant sequences is to discard all the data sequences
whose lengths do not satisfy the first condition of Definition
1 with respect to a given LAR . For this purpose, we
classify the sequences into sets c1 ; c2 ; ::; cm with respect to
their lengths. Then, all sequences in any given class cj have
the same length which we would refer to as l(cj ). Next,
we can index the sequences in each class separately, ending
up with an index structure for each class. We do not favor
this approach due to several reasons. First, there will be
too many different indices and many of them will be visited
during a similarity search query. Consider a query sequence
of length N . We would have to search through the indices
of all the classes cj where (N l(cj ) N=) (Lemma
4.4). Although the main objective is to minimize the number
of distance computations, searching through many shallow
index structures would increase the I/O considerably.
We propose to group these classes with respect to the
length of the sequences they have, and the LAR . By grouping a number of these classes and building the indices on
these groups we hope to increase the efficiency. The classes
c1 ; :::; cm are further grouped into sets g1; ::; gn (n << m)
using the following procedure.
We made the implicit assumption that all the queries are
specified with respect to the same length aspect ratio, ,
the value which we used for grouping. Note that it is also
possible to specify queries using different values for . This
certainly affects the number of groups searched in the query
where we may not be able to guarantee the fact that at most
3 groups will be visited. The following theorem elaborates
more on these bounds.
Theorem 5.2 Let be the LAR for the application which the
grouping is based and 0 be the LAR specified for the query.
1. if 0 1=2 , then there will be at most 2 groups visited
for the query.
2. if i+1 0 i then the maximum number of groups
to be visited is 2i + 3.
5.2 Indexing the Groups by VP-trees
We use vp-trees (vantage point trees) [Uhl91] as the index
structure for the groups of sequences. The vantage point tree
is a balanced distance-based index structure that could be
useful in any metric data space. At each level of the tree, a
vantage point is picked among the data points that will be
indexed below that level. After that, the distance between
that vantage point and the other points are computed, and
the points are sorted with respect to their distances from the
vantage point. This sorted list is divided into m groups of
equal cardinality where m is the order of the tree. Each
of these subgroup of points are indexed at the next level in
the same way. The vp-tree does not make any assumption
about the geometry of the data space, but only assumes that
the distance function used is metric. That is why we made
the assumption that the sequence elements could be matched
only if they are equal.
Similarity search algorithm for vp-trees is also simple,
and only based on filtering out branches that index distant
points using the triangle inequality. The search proceeds as
follows. Assume we have a query item Q and we want all
Grouping classes by lengths:
input: classes c1 ; :::; cm output: groups g1 ; ::; gn
1. let i = 1;
2.
j = 1.
ci will be in gj .
3.
i = k + 1; j = j + 1;
Let ck be the class such that l(ck ) is
the smallest (among other classes) that is greater than
or equal to l(ci )= (i.e., l(ck ) dl(ci )=e). For all
t; k t > i, ct will also be in gj . We will refer to l(ci )
and l(ck ) values as min(gj ) and max(gj ) respectively.
4. if i m, go to step 2.
Example 5.2 Assume that we have c1 ; ::; c35 where l(ci )
= 89+i for i= 1,..,35. Let be equal to 0.9. Since l(c1 ) =
6
data items that are within distance r of Q. We start from
the root of the vp-tree. The vantage point of the root node
(vproot ) partitions the data space into spherical cuts, where
each branch below that node (root) accommodates the points
that fall into one of these spherical cuts. These branches
are searched with respect to the inner and outer radii of the
spherical cut they correspond to. So, if a branch has the inner
radius Ri and the outer radius Ro , it is searched only if:
find data sequences that may match Q.
When searching the vp-trees, the search tree is pruned
with respect to a threshold value. The search looks for data
sequences whose (edit) distances from the query sequence
are less than or equal to that threshold. The following lemma
states the value for that threshold for a given query sequence.
Lemma 5.1 For any given query sequence Q with length N ,
The distance between Q and any matching sequence is less
than or equal to N (1 + 1= ? 2).
dist(vproot; Q) + r Ri and dist(vproot; Q) ? r Ro
otherwise the branch is not searched. The search proceeds
the same way continuing from the root of the branches that
qualify.
The edit distance function is used when constructing the
vp-trees since it is a metric distance function (with the
assumption we made). At search time, the distances between
the query point and the vantage-points will be calculated
using the edit distance function to direct the search properly.
However, the distances between the data points (sequences)
in the leaves (the points that are not vantage points) and
the query point will be calculated using the modified edit
distance function. Note that the modified edit distance
function overestimates the actual edit distance function as it
was shown in Theorem 4.1. Part 1 of Theorem 4.1 guarantees
that we do not dismiss any of the matching sequences
since the modified edit distance function computes the exact
distance for such sequences. The overestimation of the
actual distances for the other sequences does not hurt, as
we would dismiss them in the end anyway since they do not
match. In conclusion, using modified edit distance function
in the verification phase does not violate our correctness
guarantee (i.e., no false dismissals) while providing a faster
way for identifying matching sequences. Note that we
cannot use the modified edit distance for construction of the
vp-trees and throughout the full search process since it is
not metric (it may overestimate). We only use the modified
edit distance function to check the data points in the leaves
to see if they match the query sequence. Also note that the
vantage points that match the query point with respect to the
specified similarity measure are reported early in the search,
since the exact edit distance between those vantage points
and the query point has already been calculated.
The vp-tree is a static index structure, i.e., it is built on a
static set of data items. Insertions, are possible, but they can
be handled at the expense of violating the balance of the vptree. Therefore, we will assume that we are given a fixed set
of data sequences that won’t change (or change very little)
throughout the application. This means that once the groups
g1; ::; gn are formed, there will not be any insertions to or
deletions from them.
The process to answer a similarity match query is rather
simple. First, we find all the groups that may have data
sequences which could possibly match the query sequence.
This is strictly done based on the value of (LAR) and the
length of Q. Next, we search the vp-trees for these groups to
Note that, the value given above for the threshold is an upper
bound which is simply used to direct the search. A data
sequence S can be actually matched with a query sequence
Q if and only if the edit distance between them satisfy the
condition given in Lemma 4.3.
We can actually come up with a better (smaller) threshold
value for searching the vp-tree of each group since we
know the minimum and the maximum lengths of the data
sequences accommodated in each group. The following
theorem presents these better threshold values in searching
for matching sequences in a specific group.
Theorem 5.3 For any given query sequence Q with length
N , the distance between Q and any matching sequence in a
group gi is less than or equal to
N + max(gi ) ? 2N ,
N + N= ? 2N ,
N + max(gi ) ? 2min(gi) ,
if max(gi ) N
if min(gi ) N
if min(gi ) < N < max(gi )
5.3 Experimental Results
In the experiments we have done for testing our indexing
scheme, the data set consists of around 10000 integer
sequences of lengths ranging from 17 to 40. These integer
sequences are obtained by rounding real number sequences
taken from time series of stock data, and simulated stock
data with the use of statistical formulas. We compared our
indexing scheme with the sequential search method in terms
of the number of distance computations required. In Figure
6, we display the results for four different cases where we
varied the order of vp-trees we used in indexing the groups.
The terms vpt2, vpt3, vpt4, vpt5 refer to the cases where
vp-trees of order 2, 3, 4, and 5 are used respectively. We
show the ratio of exact edit distance computations and the
modified edit distance computations for each of these vptrees. These ratios are obtained by taking the average over
100 queries. In this particular application, with the indexing
scheme only 21-23 percent of the distance computations
are done as compared to using the sequential method. In
terms of the number of distance computations, the sequential
method makes on the average 3153 distance computations
while with the indexing method the average number of
distance computations varies between 658-713. Depending
on the distance distribution among the sequences, the gain
in percentage can be much higher in different applications
7
for other data domains such as genetic sequences or text
sequences. Another observation is that if higher order vptrees are used, we end up doing more modified edit distance
computations and less exact edit distance computations, due
to the fact that more data sequences are accommodated in
the leaves if the order is high. In Figure 3, the results for
vpt(4) does not seem to conform to this trend, however, it
can be considered as an exception since the performance
of vp-trees very much depend on the random function
used to pick the vantage points. Making more modified
edit distance computations may be desirable since an exact
edit distance computation is more costly compared to a
modified edit distance computation. On the other hand, it
is not very desirable to end up with shallow high order vptrees since that would increase the total number of distance
computations.
sequences and the sequence elements are approximately
matched (i.e., if they are within a matching distance of
each other). In this case, as discussed in [BYO97], we can
still make use of the indexing scheme with the rounding
of numerical values and matching sequence elements based
on equality after rounding . However, this imposes the
possibility of false-dismissals into the filtering process.
Designing an indexing mechanism for efficient filtering
of numerical sequences for similarity matching queries
without any false dismissals is a challenging future research
problem.
References
[AFS93]
R. Agrawal, C. Faloutsos and A. Swami, ”Efficient Similarity Search In Sequence Databases”, FODO Conference, Evanston, Illinois, Oct., 93.
[ALSS95] R. Agrawal, K.I. Lin, H.S. Sawhney and K. Shim, ”Fast
Similarity Search in the Presence of Noise, Scaling,
and Translation in Time-Series Databases”, Proc. of the
21th VLDB Conf., 1995.
Ratio of distance computations with the indexing scheme
versus without the indexing scheme (sequential search).
Total distance computations
[BKSS90] N. Beckmann, H.P. Kriegel, R. Schneider, B. Seeger,
”The R -tree: An Efficient and Robust Access Method
for Points and Rectangles”, Proc. of the ACM SIGMOD
Conf., 1990.
Edit distance computations
Modified edit distance computations
0.25
[Bri95]
S. Brin, ”Near Neighbor Search in Large Metric
Spaces”, VLDB Conf., 1995.
0.2
[BYO97] T. Bozkaya, N. Yazdani, Z.M. Ozsoyoglu, “Matching
and Indexing Sequences of Different Lengths”, Tech.
Report, CES, CWRU, 1997.
0.15
0.1
0.05
vpt3
vpt4
T.H. Cormen, C.E. Leiserson and R.L. Rivest, ”Introduction to Algorithms”, MIT Press, 1990.
[CR94]
M. Crochemore and W. Rytter, ”Text Algorithms”,
Oxford Univ. Press, 1994.
[Fuk72]
K. Fukunaga, ”Introduction to Statistical Pattern
Recognition”, Academic Press, New York, 1972.
[FRM94] C. Faloutsos, M. Ranganathan and Y. Manolopoulos, ”Fast Subsequence Matching in Time-Series
Databases”, ACM SIGMOD Conf., 1994.
0
vpt2
[CLR90]
vpt5
Figure 3: Efficiency of the indexing scheme vs sequential search
6 Conclusion
We propose a method for similarity matching of sequences
with different lengths. The method uses a modified version
of the edit distance algorithm which is used for approximate
text matching and is based on dynamic programming. For
numerical sequences, our method also includes the mapping
process of unmatched elements of sequences. We also
provide an indexing mechanism which is used for efficient
filtering of distant (in terms of similarity) sequences in a
similarity match query. Our indexing method avoids false
dismissals where the distance function used to compute the
similarity distance between sequences is metric. This is
not the case for edit distance when it is used for numerical
[LT94]
D. Lopresti, A. Tomkins, ”On the Searchability of
Electronic Ink”, in IWFHR 94.
[LT95]
D. Lopresti, A. Tomkins, ”Block Edit Models for
Approximate String Matching”, in SSAWSP 95.
[SHB92]
J. Snyder, A. Hiltner, E. Baer, ”Analysis of the wedgeshaped damage zone in edge-notched polypropylene”,
Jour. of Materials Sci. (27), 1992.
[Uhl91]
J.K.
Uhlmann,
”Satisfying
General Proximity/Similarity Queries with Metric Trees”,
Information Processing Letters, v40, p175-179,1991.
[YOO94] N. Yazdani, M. Ozsoyoglu and G. Ozsoyoglu, ”A
Framework For Feature-Based Indexing for Spatial
Databases”, Proceeding of 7’th Int. Conf. on Statistical
and Scientific Database, 1994.
[YO96]
8
N. Yazdani and M. Ozsoyoglu ”Sequence Matching of
Images”, Proceeding of 8’th Int. Conf. on Statistical and
Scientific Database, 1996.
Download