Matching and Indexing Sequences of Different Lengths Tolga Bozkaya Nasser Yazdani Meral Özsoyoǧlu Department of Computer Engineering and Science Case Western Reserve University Cleveland, OH 44106 fbozkaya, yazdani,ozsoyg@ces.cwru.edu Abstract and to find the correspondence among the elements of the sequences. To answer similarity match queries, we certainly need an efficient index structure. We can not use conventional index structures, since we do not assume any geometry on the domain of sequences. Instead, we designed an indexing scheme which is based on the lengths of sequences and relative distances between sequences. A distance-based index structure, vp-tree [Uhl91], is used as the underlying index structure in our method. The indexing scheme is used as a major filtering mechanism to eliminate distant sequences in processing a similarity match query. The rest of the paper is organized as follows. In section 2 we provide a brief overview of the previous work on sequence matching problem. In section 3, we give the motivation for our work, and present the definition for matching sequences. Section 4 contains our methods for computing the similarity between sequences, and the matching process. In section 5, we propose a general method for indexing sequences (of different lengths) for similarity match queries with respect to the matching process of section 4. Section 6 concludes. In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same length using Euclidean distance metric. For similarity matching of sequences, we use a modified version of the edit distance function, and consider two sequences matching if a majority of the elements in the sequences match. In the matching process a mapping among non-matching elements is created to check if there are unacceptable deviations among them. This means that two matching sequences should have lengths that are comparable. For efficient retrieval of matching sequences, we propose an indexing scheme which is totally based on lengths and relative distances between sequences. We use vp-trees as the underlying distance-based index structures in our method. 1 Introduction The problem of matching sequences with respect to a similarity measure is encountered in a variety of applications such as text and information processing, genetics, time series analysis, scientific databases, etc. It is important for these applications to efficiently identify and retrieve similar data items (sequences) to a given query item. The results of these queries can be used for different purposes such as information retrieval, data mining, classification, etc. In this paper, we address the general problem of indexing and matching sequences of different lengths while putting more emphasis on numerical sequences from scientific experiments since that was our motivating application. Still, our methods are general and can be applied to other data domains with sequences. In our work, we use a modified version of the edit distance function to compute the distance between two sequences, 2 Related Work To our knowledge, [AFS93] is the first work which proposes a solution for similarity matching of sequences. In [AFS93], it is assumed that all sequences are of the same length, and each sequence is considered as a point in an N -dimensional space. Then, two sequences are considered similar when the Euclidean distance between them is less than a threshold value . The authors use R -tree [BKSS90] as the index structure. Sequences are represented as K -dimensional points using K features for each sequence. Discrete Fourier Transform (DFT) is used for feature extraction since it preserves the Euclidean distance. Faloutsos et al. extend the method proposed in [AFS93] to locate subsequences that match a query sequence or a subsequence of it [FRM94]. DFT is used for feature extraction, which preserves the Euclidean distance between sequences. The fact that it is a distance preserving transformation makes DFT attractive for indexing. However, it could be used only for sequences of the same length. Also, it is not very effec- This research is partially supported by the National Science Foundation grant IRI 92-24660, and the National Science Foundation FAW award IRI-90-24152 1 tive for sequences with mostly uncorrelated elements (such as random vectors). Another feature extraction method for sequences is the Eigenvector method [Fuk72]. The Eigenvector method also preserves Euclidean distance, however, it also works with sequences of the same length. Rakesh Agrawal et al., [ALSS95], gives a method to retrieve similar sequences in the presence of noise, scaling and translation in time series. In [LT94], a modified version of edit distance function is used to find the matching text to a hand-written text using pen-stroke data. This approach is similar to ours in the sense that we also use a modified version of editdistance function to compute similarity between sequences. However, we use different cost functions and we also create a one-to-one mapping between matching and non-matching elements of sequences making deletions and insertions (via interpolation) if necessary. A sequence matching method for sequences of different lengths based on dynamic programming is proposed in [YO96]. A modified version of the Longest Common Subsequences (LCS), [CLR90], was used for actual matching of the elements of two sequences. For filtering, a feature-based indexing mechanism is used. where the length, mean, and variance (first two moments) of each sequence are used as features. Indexing sequences to efficiently handle the similarity matching queries is also an important problem. In terms of indexing, we see that a good portion of the previous work [AFS93, FRM94] concentrates on matching sequences where the similarity between sequences are assumed to be Euclidean. However, note that for many domains and applications (such as text databases, genetic sequences) Euclidean distance function cannot be used directly as a similarity metric because the domain is non-numeric and sequences may be of different lengths. In these cases, indexing methods for Euclidean spaces are usually not applicable. Still, there are other distance-based index structures [Uhl91, Bri95] that do not assume any geometry of the application domain, but only depend on the fact that the distance function is metric (See section 5 for definition). The simple idea behind these structures is to use some reference object(s) to partition the search space with respect to the distances of data objects to the reference object(s). At the time of search, some of these partitions are not searched any further depending on their relative distances to the reference point(s) and the distance(s) between the query point and the reference point(s). The vp-tree [Uhl91], a distance-based index structure, is explained in more detail in section 5, since it is used as part of our indexing scheme. are important to study the suitability of different materials for many engineering applications. The indexing method extracts some features such as the number of vertices in the polygonal approximation of the shapes and the number of components, and indexes them with respect to these features. Extracted features are represented as a feature vector for each image. [YO96] introduces a method for matching sequences of damage zone shapes in particular, and images from lab experiments in general. This method also uses the idea of feature extraction and representation of images by their feature vectors where each feature is represented by a numerical value. Therefore, each image sequence can be represented by a N M matrix where N is the number of features for each image and M is the number of images in the sequence. The extracted features can also be the amount of change of some features like area, width, etc, in successive frames. Using such features would be helpful in studying change patterns by identifying sequences with similar patterns via similarity match queries. For the feature extraction method to be effective, the following conditions must hold. Let R and S be two image sequences and FR and FS be their feature vectors. Then, If R and S match =) FR and FS match. If R is similar to S =) FR is similar to FS. Using the feature vector representation, the problem of matching image sequences is transformed into the problem matching sequences of numerical values in an N dimensional space. It may be the case that the number of images is not guaranteed to be the same for different sequences, even for the sequences obtained by repeating one experiment (since sampling rates may be different, and some data elements might be lost in the sampling period). We should be able to compare these sequences since such comparisons are important to verify the results of previous experiments and identify common patterns in these experiments. In this paper, we use the definition for matching two sequences of different lengths given in [YO96], which uses the same idea of transforming sequence matching problem to numeric sequence matching in the way discussed above. Before we generalize and give Definition 3.1 for matching sequences in scientific experiments, we would like to address the following points for motivation. 3 Motivation and Problem Definition In an earlier work [YOO94], we proposed a feature-based indexing method for exact as well as similarity matching of damage zone shapes (areas) in polypropylene that result under high stress at low temperatures [SHB92]. Such queries 2 The relative times that the corresponding samples are taken are almost the same in both sequences. This means that the lengths of sequences should be close to each other to be matched. The elements of both sequences are taken from the lifetime of the experiment in a rather uniformly manner. Two sequences can be considered matching (or similar) if majority of their elements match. In numeric sequences from scientific domains, since the elements are real numbers obtained during the experiment with a limited precision, elements from different 4 Matching Process sequences should be matched based on proximity. In non-numeric sequences, matching is usually done based on equality. Since our method uses a modified version of edit distance for approximate text matching, first, we briefly review the edit distance and approximate text matching problem. Next, we discuss our method for similarity matching of numeric sequences of different lengths. Definition 3.1 Assume two sequences S and Q have lengths M and N respectively, and si1 ; si2 ; :::; siK are matching qj1 ; qj2 ; :::; qjK . The sequences S and Q match each other if 1. 4.1 Edit Distance and Approximate Text Matching The edit distance between two alphanumeric sequences, (referred as text and pattern) is defined [CR94] as the minimum number of operations that are needed to change a query text into a pattern. Three operations delete, insert and change are allowed. Assuming the cost of each of these operations is 1, the edit distance is the minimum number of operations needed to obtain a pattern from a text. A dynamic programming solution for this problem is given in [CR94]. Let D[i; j ] denote the minimum edit distance between two texts Q[1 : : : i] and S [1 : : : j ]. Then, the minimum distance is defined recursively as follows, minfN; M g maxfN; M g where 1. Here, is referred as the length aspect ratio, and it is preferably close to 1. 2. distance(sik ; qjk ) for all k matching elements, 1 k K where K is the number of matching elements. is referred as the matching distance. 3. K minfN; M g where 1. is referred as the 8 >> < D[i; j ] = > >: matching coefficient, and it is also preferably close to 1 for higher selectivity. 4. A one-to-one mapping can be found between all unmatched elements (making insertions or deletions if necessary) of the two sequences such that for all mapped elements si and qj , the relation distance(si ; qj ) < holds ( 0). (Most likely, will be a function of if > 0) j If i=0; i If j=0; D[i-1,j-1] If i; j > 0 and qi is equal to sj ; Otherwise; min D[i-1,j-1]+1, D[i-1,j]+1, D[i,j-1]+1 (1) f g The values in the last part of this formula correspond to change, delete and insert respectively. Formula (1) also introduces an algorithm to find edit distance between a text Q and a pattern S with lengths N and M in time proportional to O(MN ). A particular case of the edit distance algorithm gives the longest common subsequence (LCS) of two sequences [CR94]. Let us assume D[i; j ] is the minimal number of deletions and insertions, not changes, necessary to transform a text Q[1 : : : i] into S [1 : : : j ]. Evaluation of D[i; j ] is equivalent to computation of the length of LCS between Q[1 : : :i] and S [1 : : : j ]. Indeed, this is a restricted version of edit distance where changes are not allowed. However, we will refer to it as as the edit distance in this paper. Assume C [i; j ] is the length of the LCS of Q[1 : : : i] and S [1 : : : j ]. The following lemma shows the relationship between C [i; j ] and D[i; j ] for two sequences Q and S with lengths N and M . Due to space limitations, for the proof of the following lemma and the other proofs, please refer to [BYO97]. Note that the distance between two elements si and qj can be defined in a different way for each domain. For numeric sequences, distance(si ; qj ) is simply jsi ? qj j. This simple distance function can’t be used for every domain such as multi-dimensional vector sequences or non-numeric sequences. As mentioned before, in most applications, the elements of non-numeric sequences can be matched based on equality ( = 0). In that case, the distance between any two elements is defined to be 0 if they are equal, and a positive number if they are not. The method proposed in [YO96] which is based on the Longest Common Subsequence problem does not provide any mapping for the unmatched elements and checking the condition 4 of the definition above. Note that mapping unmatching elements is important for some domains (ex:scientific experiments) where it is not desirable for matching sequences to have large deviations along the unmatched elements. Most of the sequence matching methods, in an environment with a large set of data sequences, work in two phases. In the first phase, a finite number of data sequences are filtered out by searching in an index structure. These sequences are hypothesized as matching candidates with the given query sequence. In the second phase, all hypothesized sequences are verified for actual matching. Our method also works in two phases. In section 4, we explain our method for matching two sequences. Indexing and filtering will be discussed in section 5. 2C [Q; S ] = M + N ? D [Q; S ] and C [i; j ] = i + j ? D[i; j ] for 0 i N; and 0 j M: Lemma 4.1 [CR94] 2 This is an interesting result. It indicates that finding the restricted edit distance is a dual problem of finding the LCS problem, therefore, any solution to it automatically gives the answer to the LCS of Q and S . 4.2 Edit Distance and Sequence Matching We now explain how the edit distance can be applied to find matching sequences of different lengths. 3 1. Instead of checking equality between two elements qi and sj from two sequences Q and S (respectively), we check if the elements are within a matching distance from each other. In other words, it is checked whether the relation distance(sj ; qj ) holds. In computing the values of D[i; j ]s in Formula 3, we give the preference from left to right. Therefore, whenever two or three expressions give the same value for the minimum, we choose change or insertion first. Deletion will be the last choice. Nevertheless, in order to apply Formula 3 to the sequence matching problem, the original values are kept in case of changes. Applying Formula 3 to the sequences of Example 4.1 gives the alignment shown in Figure 2. The dashed boxes show the mapping among unmatched elements. As it is shown in Figure 2, it gives a better mapping and a better alignment than the one obtained using Formula 2. 2. Change is not allowed in computing the edit distance. 3. For numeric sequences, in insertion, the new elements are calculated by interpolation instead of just inserting elements from other sequences. For instance, the value qi +qi+1 would be inserted between q and q . i i+1 2 Formula 2 below calculates the edit distance between two sequences Q and S . 8> < D[i; j ] = > : j If i=0; i If j=0; D[i-1,j-1] If i; j > 0 and qi matches sj ; min D[i-1,j]+1, D[i,j-1]+1 Otherwise ; (2) f g The mapping between matched and unmatched elements can be found easily from this formula. Example 4.1 Consider two sequences Q =<2.2, 3.9, 2.9, 1.9, 4.5, 3.2> and S =<3.2, 4.2, 2.1, 3.3, 4.1, 4.3, 3.1> and let us assume the matching distance as = 0:5 by applying [BYO97] formula 2 to sequences Q and S which is shown in Figure 1. As Figure 1 illustrates, some elements of Q are deleted and also some new elements are inserted by interpolation. The matching elements from the two sequences are shown by vertical lines. The method extends and projects Q in such a way that both sequences have the same length in the end. Besides a bad alignment, S 3.2 4.2 2.9 2.4 2.1 3.3 4.1 1.9 2.8 3.7 4.3 S 3.2 4.2 2.1 3.3 4.1 4.3 3.1 Q 2.2 3.9 3.4 2.9 1.9 4.5 3.2 The new inserted elements Figure 2: An alignment obtained by applying Formula 3. The running time for finding the edit distance between two sequences Q and S with applying Formulas 2 and 3 is O(MN ) where N and M are the lengths of Q and S respectively. However, since the definition for matching of two sequences (Definition 3.1) puts a restriction on the number of non-matching elements, the formulas, and consequently the algorithm, can be modified in order to make it more efficient. The following lemma states this fact. 3.1 3.2 Lemma 4.3 Assume Q and S with lengths N and M are two sequences which match each other with a matching coefficient , and D[S; Q] denotes the edit distance between S and Q. Then, D[S; Q] M + N ? 2minfM; N g. there is a subtle flaw in this method. It does not use the unmatched elements of the query sequence. Hence, formula 2 is modified as follows: Lemma 4.3 bounds the value of edit distance between two matching sequences and this can be used to expedite the matching process by giving a tool to exit the algorithm when two sequences do not match. It also gives a powerful tool to limit the range of operations in finding the edit distance. Let us assume li and ri are two lines which identify the boundary of the range and are defined as follows, Q 2.2 3.9 Deleted elements 4.5 New inserted elements Figure 1: An alignment for sequences Q and S . 8 >> < D[i; j ] = > >: j If i=0; i If j=0; D[i-1,j-1] If i; j > 0 and qi matches sj ; min D[i-1,j-1]+2, Otherwise ; D[i-1,j]+1,D[i,j-1]+1 (3) f li = maxf0; i ? N + minfM; N gg g ri = minfM; i + M ? minfM; N gg This means changes are allowed, however, weight 2 is given to changes, while weight 1 is given to each deletion and insertion. The following lemma shows that Formula 3 finds the same edit distance as Formula 2 does. Then, the recursive formula to find the edit distance of two matching sequences Q and S , and, consequently, the correspondence between their elements, can be found using the formula below. The formula is defined for D[i; j ] in the Lemma 4.2 Formulas (3) and (2) are equivalent in the sense that they compute the same edit distance. 4 range li j ri . 8 >> >> >> >< D[i; j ] = > >> >> >> : If i=0 and 0 If j=0 and j i jr set of sequences as matching candidates for a given query sequence. Here, we propose an indexing scheme for efficient processing of similarity search queries when restricted edit distance between the sequences is used as the similarity measure. We will present this indexing scheme as a general solution for the problem of matching sequences of different lengths. The matching of elements are based on equality, that is, the distance between any two sequence elements will be zero if they are equal, and a positive number if they are not ( = 0). Note that, edit distance is not metric if elements of sequences are matched based on proximity with a nonzero (See Example 5.1). A metric distance function is required for an index structure to be used. Also, note the difference between the distance between sequence elements and the distance between sequences. Below, the distance function refers to the distance function for the sequences if not specified otherwise. i i N ? bminfM; N gc; If i > 0; li j ri and qi matches sj ; If i > 0; li < j < ri and min D[i-1,j-1]+2, D[i,j-1]+1,D[i-1,j]+1 qi does not match sj ; min D[i-1,j-1]+2, If i > 0; j = li > 0; D[i-1,j]+1 min D[i-1,j-1]+2, Ifi > 0; j = ri > 0; D[i,j-1]+1 D[i-1,j-1] f f f g g g (4) This formula computes the modified restricted edit distance function. However, for simplicity, we will call it modified edit distance. Theorem 4.1 Assume that we have two sequences Q and with lengths N and M respectively. Without loss of generality, let M N , and N M so that their lengths satisfy the first condition of definition 3.1. Let D[Q; S ] and ED[Q; S ] denote the modified edit distance (Formula 4), and the edit distance (Formula 3) respectively between Q and S . S A distance function d(x; y ) is metric if it satisfies the following simple conditions: (i) d(x; y ) = d(y; x) (ii) 0 d(x; y ) for x 6= y (iii) d(x; x) = 0 (iv) d(x; y ) d(x; z ) + d(y; z ) (triangle inequality) 1. If the sequences S and Q match each other with matching coefficient , then D[Q; S ] = ED[Q; S ]. The edit distance function, when elements are matched with respect to their equality ( = 0), is a metric distance function. The following example shows that edit distance is not metric if elements are matched based on proximity for a non-zero . 2. If the sequences S and Q do not match each other, then D[Q; S ] ED[Q; S ] Before analyzing the running time of the algorithm for the edit distance, we mention the following lemma from [YO96], where is the length aspect ratio (LAR) for sequences Q and S. Example 5.1 Consider three sequences of length 2 x :< 0:5; 0:5 >, y :< 0:55; 0:55 >, z :< 0:6; 0:6 > and :0.09. Note that although y matches with both z and x, x and z do not match at all, causing the violation of the triangle inequality (d(x; z ) > d(x; y ) + d(y; z )). Lemma 4.4 Assume two sequences, Q and S with lengths N and M , respectively, are matching. Then, N M N . Theorem 4.2 The edit distance of two sequences Q and S with lengths N and M , length aspect ratio and matching coefficient can be found in O(MN [1 ? 2 ]) in the worst case. Furthermore, the corresponding element of S and Q can be constructed from the table in order to find the edit distance in O(M + N ). We have previously proposed [YO96] a technique for indexing sequences based on some of their features, namely, lengths and the first two moments (mean and variance). Each data sequence is represented by a point in the feature space. Any multi-dimensional point access method can be used as the index structure. Similarity match queries are transformed into range queries in the feature space. However, this method does not guarantee that the search process does not miss any probable matching sequence in similarity matching process. This problem (referred to as false dismissal problem) originates from the fact that estimated values for mean and variance may have errors. Therefore, the search bounds which are computed based on these parameters may contain errors that may cause false dismissal of some matching sequences. Asuming the sequence elements are matched based on equality, we present here an indexing method which guarantees that there won’t be any false dismissals. A framework and an analysis for indexing for similarity search when se- The values of and determine the running time of the algorithm. For small values of and , the running time approach to O(MN ). If both and approach to 1, the running time approaches O(N ) which is expected since the problem is changed into the whole matching of two sequences with the same length. 5 Indexing and Filtering for Similarity Search Queries on Sequences Applying a sequence matching algorithm to a large set of sequences in a sequential manner is very time consuming. The main challenge is to efficiently hypothesize a small 5 = = 100 = l(c11 ), c1 ; ::; c11 are put into g1 . c12 ; ::; c24 go into g2 since d(l(c12 ) = 101)=e = 113 = l(c24 ). The last group g3 takes c25 ; ::; c35 . Note that, because of the way we do the grouping, all sequences in a given group gj have comparable lengths with respect to each other as any two sequences in gj have the length aspect ratio at most . Each of these groups will be indexed separately by a vptree. Our objective in forming the groups g1 ; ::; gn was to limit the number of index structures visited during a similarity search query. The following theorem formally states the upper bound for the number of groups to be searched for a similarity search query. Its proof follows from Lemma 4.4. 90 quence elements are matched approximately can be found in [BYO97]. Our indexing structure is constructed as follows. First, we group the data sequences with respect to their lengths. Each group accommodates sequences that are lengthwise close to each other. Second, the data sequences in each group are indexed by a vp-tree [Uhl91]. Having constructed the index structure, a similarity search proceeds in two phases. First, we identify the groups that accommodate data sequences that may match the query sequence. This is done based on the length of the query sequence. Then, in the second phase, the vp-trees for these identified groups are searched to filter out the data sequences that are distant from the query sequence, and to find the ones that are within matching distance to the query sequence. In the following subsections, we explain how grouping and indexing is done. Theorem 5.1 There will be at most 3 groups visited for a similarity search query. 5.1 Grouping Data Sequences by Length For a similarity search query, the simplest way to filter out many distant sequences is to discard all the data sequences whose lengths do not satisfy the first condition of Definition 1 with respect to a given LAR . For this purpose, we classify the sequences into sets c1 ; c2 ; ::; cm with respect to their lengths. Then, all sequences in any given class cj have the same length which we would refer to as l(cj ). Next, we can index the sequences in each class separately, ending up with an index structure for each class. We do not favor this approach due to several reasons. First, there will be too many different indices and many of them will be visited during a similarity search query. Consider a query sequence of length N . We would have to search through the indices of all the classes cj where (N l(cj ) N=) (Lemma 4.4). Although the main objective is to minimize the number of distance computations, searching through many shallow index structures would increase the I/O considerably. We propose to group these classes with respect to the length of the sequences they have, and the LAR . By grouping a number of these classes and building the indices on these groups we hope to increase the efficiency. The classes c1 ; :::; cm are further grouped into sets g1; ::; gn (n << m) using the following procedure. We made the implicit assumption that all the queries are specified with respect to the same length aspect ratio, , the value which we used for grouping. Note that it is also possible to specify queries using different values for . This certainly affects the number of groups searched in the query where we may not be able to guarantee the fact that at most 3 groups will be visited. The following theorem elaborates more on these bounds. Theorem 5.2 Let be the LAR for the application which the grouping is based and 0 be the LAR specified for the query. 1. if 0 1=2 , then there will be at most 2 groups visited for the query. 2. if i+1 0 i then the maximum number of groups to be visited is 2i + 3. 5.2 Indexing the Groups by VP-trees We use vp-trees (vantage point trees) [Uhl91] as the index structure for the groups of sequences. The vantage point tree is a balanced distance-based index structure that could be useful in any metric data space. At each level of the tree, a vantage point is picked among the data points that will be indexed below that level. After that, the distance between that vantage point and the other points are computed, and the points are sorted with respect to their distances from the vantage point. This sorted list is divided into m groups of equal cardinality where m is the order of the tree. Each of these subgroup of points are indexed at the next level in the same way. The vp-tree does not make any assumption about the geometry of the data space, but only assumes that the distance function used is metric. That is why we made the assumption that the sequence elements could be matched only if they are equal. Similarity search algorithm for vp-trees is also simple, and only based on filtering out branches that index distant points using the triangle inequality. The search proceeds as follows. Assume we have a query item Q and we want all Grouping classes by lengths: input: classes c1 ; :::; cm output: groups g1 ; ::; gn 1. let i = 1; 2. j = 1. ci will be in gj . 3. i = k + 1; j = j + 1; Let ck be the class such that l(ck ) is the smallest (among other classes) that is greater than or equal to l(ci )= (i.e., l(ck ) dl(ci )=e). For all t; k t > i, ct will also be in gj . We will refer to l(ci ) and l(ck ) values as min(gj ) and max(gj ) respectively. 4. if i m, go to step 2. Example 5.2 Assume that we have c1 ; ::; c35 where l(ci ) = 89+i for i= 1,..,35. Let be equal to 0.9. Since l(c1 ) = 6 data items that are within distance r of Q. We start from the root of the vp-tree. The vantage point of the root node (vproot ) partitions the data space into spherical cuts, where each branch below that node (root) accommodates the points that fall into one of these spherical cuts. These branches are searched with respect to the inner and outer radii of the spherical cut they correspond to. So, if a branch has the inner radius Ri and the outer radius Ro , it is searched only if: find data sequences that may match Q. When searching the vp-trees, the search tree is pruned with respect to a threshold value. The search looks for data sequences whose (edit) distances from the query sequence are less than or equal to that threshold. The following lemma states the value for that threshold for a given query sequence. Lemma 5.1 For any given query sequence Q with length N , The distance between Q and any matching sequence is less than or equal to N (1 + 1= ? 2). dist(vproot; Q) + r Ri and dist(vproot; Q) ? r Ro otherwise the branch is not searched. The search proceeds the same way continuing from the root of the branches that qualify. The edit distance function is used when constructing the vp-trees since it is a metric distance function (with the assumption we made). At search time, the distances between the query point and the vantage-points will be calculated using the edit distance function to direct the search properly. However, the distances between the data points (sequences) in the leaves (the points that are not vantage points) and the query point will be calculated using the modified edit distance function. Note that the modified edit distance function overestimates the actual edit distance function as it was shown in Theorem 4.1. Part 1 of Theorem 4.1 guarantees that we do not dismiss any of the matching sequences since the modified edit distance function computes the exact distance for such sequences. The overestimation of the actual distances for the other sequences does not hurt, as we would dismiss them in the end anyway since they do not match. In conclusion, using modified edit distance function in the verification phase does not violate our correctness guarantee (i.e., no false dismissals) while providing a faster way for identifying matching sequences. Note that we cannot use the modified edit distance for construction of the vp-trees and throughout the full search process since it is not metric (it may overestimate). We only use the modified edit distance function to check the data points in the leaves to see if they match the query sequence. Also note that the vantage points that match the query point with respect to the specified similarity measure are reported early in the search, since the exact edit distance between those vantage points and the query point has already been calculated. The vp-tree is a static index structure, i.e., it is built on a static set of data items. Insertions, are possible, but they can be handled at the expense of violating the balance of the vptree. Therefore, we will assume that we are given a fixed set of data sequences that won’t change (or change very little) throughout the application. This means that once the groups g1; ::; gn are formed, there will not be any insertions to or deletions from them. The process to answer a similarity match query is rather simple. First, we find all the groups that may have data sequences which could possibly match the query sequence. This is strictly done based on the value of (LAR) and the length of Q. Next, we search the vp-trees for these groups to Note that, the value given above for the threshold is an upper bound which is simply used to direct the search. A data sequence S can be actually matched with a query sequence Q if and only if the edit distance between them satisfy the condition given in Lemma 4.3. We can actually come up with a better (smaller) threshold value for searching the vp-tree of each group since we know the minimum and the maximum lengths of the data sequences accommodated in each group. The following theorem presents these better threshold values in searching for matching sequences in a specific group. Theorem 5.3 For any given query sequence Q with length N , the distance between Q and any matching sequence in a group gi is less than or equal to N + max(gi ) ? 2N , N + N= ? 2N , N + max(gi ) ? 2min(gi) , if max(gi ) N if min(gi ) N if min(gi ) < N < max(gi ) 5.3 Experimental Results In the experiments we have done for testing our indexing scheme, the data set consists of around 10000 integer sequences of lengths ranging from 17 to 40. These integer sequences are obtained by rounding real number sequences taken from time series of stock data, and simulated stock data with the use of statistical formulas. We compared our indexing scheme with the sequential search method in terms of the number of distance computations required. In Figure 6, we display the results for four different cases where we varied the order of vp-trees we used in indexing the groups. The terms vpt2, vpt3, vpt4, vpt5 refer to the cases where vp-trees of order 2, 3, 4, and 5 are used respectively. We show the ratio of exact edit distance computations and the modified edit distance computations for each of these vptrees. These ratios are obtained by taking the average over 100 queries. In this particular application, with the indexing scheme only 21-23 percent of the distance computations are done as compared to using the sequential method. In terms of the number of distance computations, the sequential method makes on the average 3153 distance computations while with the indexing method the average number of distance computations varies between 658-713. Depending on the distance distribution among the sequences, the gain in percentage can be much higher in different applications 7 for other data domains such as genetic sequences or text sequences. Another observation is that if higher order vptrees are used, we end up doing more modified edit distance computations and less exact edit distance computations, due to the fact that more data sequences are accommodated in the leaves if the order is high. In Figure 3, the results for vpt(4) does not seem to conform to this trend, however, it can be considered as an exception since the performance of vp-trees very much depend on the random function used to pick the vantage points. Making more modified edit distance computations may be desirable since an exact edit distance computation is more costly compared to a modified edit distance computation. On the other hand, it is not very desirable to end up with shallow high order vptrees since that would increase the total number of distance computations. sequences and the sequence elements are approximately matched (i.e., if they are within a matching distance of each other). In this case, as discussed in [BYO97], we can still make use of the indexing scheme with the rounding of numerical values and matching sequence elements based on equality after rounding . However, this imposes the possibility of false-dismissals into the filtering process. Designing an indexing mechanism for efficient filtering of numerical sequences for similarity matching queries without any false dismissals is a challenging future research problem. References [AFS93] R. Agrawal, C. Faloutsos and A. Swami, ”Efficient Similarity Search In Sequence Databases”, FODO Conference, Evanston, Illinois, Oct., 93. [ALSS95] R. Agrawal, K.I. Lin, H.S. Sawhney and K. Shim, ”Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases”, Proc. of the 21th VLDB Conf., 1995. Ratio of distance computations with the indexing scheme versus without the indexing scheme (sequential search). Total distance computations [BKSS90] N. Beckmann, H.P. Kriegel, R. Schneider, B. Seeger, ”The R -tree: An Efficient and Robust Access Method for Points and Rectangles”, Proc. of the ACM SIGMOD Conf., 1990. Edit distance computations Modified edit distance computations 0.25 [Bri95] S. Brin, ”Near Neighbor Search in Large Metric Spaces”, VLDB Conf., 1995. 0.2 [BYO97] T. Bozkaya, N. Yazdani, Z.M. Ozsoyoglu, “Matching and Indexing Sequences of Different Lengths”, Tech. Report, CES, CWRU, 1997. 0.15 0.1 0.05 vpt3 vpt4 T.H. Cormen, C.E. Leiserson and R.L. Rivest, ”Introduction to Algorithms”, MIT Press, 1990. [CR94] M. Crochemore and W. Rytter, ”Text Algorithms”, Oxford Univ. Press, 1994. [Fuk72] K. Fukunaga, ”Introduction to Statistical Pattern Recognition”, Academic Press, New York, 1972. [FRM94] C. Faloutsos, M. Ranganathan and Y. Manolopoulos, ”Fast Subsequence Matching in Time-Series Databases”, ACM SIGMOD Conf., 1994. 0 vpt2 [CLR90] vpt5 Figure 3: Efficiency of the indexing scheme vs sequential search 6 Conclusion We propose a method for similarity matching of sequences with different lengths. The method uses a modified version of the edit distance algorithm which is used for approximate text matching and is based on dynamic programming. For numerical sequences, our method also includes the mapping process of unmatched elements of sequences. We also provide an indexing mechanism which is used for efficient filtering of distant (in terms of similarity) sequences in a similarity match query. Our indexing method avoids false dismissals where the distance function used to compute the similarity distance between sequences is metric. This is not the case for edit distance when it is used for numerical [LT94] D. Lopresti, A. Tomkins, ”On the Searchability of Electronic Ink”, in IWFHR 94. [LT95] D. Lopresti, A. Tomkins, ”Block Edit Models for Approximate String Matching”, in SSAWSP 95. [SHB92] J. Snyder, A. Hiltner, E. Baer, ”Analysis of the wedgeshaped damage zone in edge-notched polypropylene”, Jour. of Materials Sci. (27), 1992. [Uhl91] J.K. Uhlmann, ”Satisfying General Proximity/Similarity Queries with Metric Trees”, Information Processing Letters, v40, p175-179,1991. [YOO94] N. Yazdani, M. Ozsoyoglu and G. Ozsoyoglu, ”A Framework For Feature-Based Indexing for Spatial Databases”, Proceeding of 7’th Int. Conf. on Statistical and Scientific Database, 1994. [YO96] 8 N. Yazdani and M. Ozsoyoglu ”Sequence Matching of Images”, Proceeding of 8’th Int. Conf. on Statistical and Scientific Database, 1996.