simsearch

advertisement
Similarity Searches in Sequence
Databases
Sang-Hyun Park
KMeD Research Group
Computer Science Department
University of California, Los Angeles
Contents






Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion
What is Sequence?

A sequence is an ordered list of elements.
S = 14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8, 15.1
25
20
temperatur 15
e
10
o
( C)
5
time
8AM 10AM 12PM 2PM 4PM 6PM 8PM 10PM

Sequences are principal data format in many
applications.
What is Similarity Search?


Similarity search finds sequences whose changing
patterns are similar to that of a query sequence.
Example




Detect stocks with similar growth patterns
Find persons with similar voice clips
Find patients whose brain tumors have similar evolution
patterns
Similarity search helps in clustering, data mining, and
rule discovery.
Classification of Similarity Search

Similarity Searches are classified as:



Whole sequence searches
Subsequence searches
Example




S =  1,2,3 
Subsequences (S) = { 1, 2, 3, 1,2, 2,3, 1,2,3 }
In whole sequence searches,
the sequence S itself is compared with a query sequence Q.
In subsequence searches,
every possible subsequence of S can be compared with a query
sequence q.
Similarity Measure

Lp Distance Metric
LP (S, Q)  (  i1| S[i]  Q[i]| P ) P
n




L1 : Manhattan distance or city-block distance
L2 : Euclidean distance
L : maximum distance in any element pairs
requires that two sequences should have the same length
Similarity Measure (2)

Time Warping Distance




Originally introduced in the area of speech recognition
Allows sequences to be stretched along the time axis
3,5,6  3,3,5,6  3,3,3,5,6  3,3,3,5,5,6  …
Each element of a sequence can be mapped to one or more
neighboring elements of another sequence.
Useful in applications where sequences may be of different
lengths or different sampling rates
Q = 10, 15, 20 
S =  10, 15, 16, 20 
Similarity Measure (3)

Time Warping Distance (2)


Defined recursively
Computed by dynamic programming technique, O(|S||Q|)
DTW (S, Q[2:-])
DTW (S, Q) = DBASE (S[1], Q[1]) + min
DBASE (S[1], Q[1]) = | S[1] – Q[1] |
Q Q[1]
Q[2:-]
S S[1]
S[2:-]
P
DTW (S[2:-], Q)
DTW (S[2:-], Q[2:-])
Similarity Measure (4)

Time Warping Distance (3)


S = 4,5,6,7,6,6, Q = 3,4,3
When using L1 as a DBASE, DTW (S, Q) = 12
6
6
7
6
5
4
16
13
10
6
3
1
S Q 3
11
9
7
4
2
1
4
12
10
8
5
3
2
3
| S[i]Q[j] | + min (V1,V2,V3)
S[i] V2
V3 V1
Q[j]
False Alarm and False Dismissal

False Alarm



Candidates not similar to a query.
Minimize false alarms for efficiency
False Dismissal


Similar sequences not retrieved by index search
Avoid false dismissals for correctness
data sequences
candidates
false alarm
similar
seq.
candidates
similar
seq.
false dismissal
Contents






Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion
Problem Definition

Input




Output


Set of data sequences {S}
Query sequence Q
Distance tolerance 
Set of data sequences whose distances to Q are within 
Similarity Measure



Time warping distance function, DTW
L as a distance function for each element pair
If the distance of every element pair is within , then
DTW(S,Q)  .
Previous Approaches

Naïve Scan [Ber96]




Read every data sequence from database
Apply dynamic programming technique
For m data sequences with average length L, O(mL|Q|)
FastMap-Based Technique [Yi98]




Use FastMap technique for feature extraction
Map features into multi-dimensional points
Use Euclidean distance in index space for filtering
Could not guarantee “no false dismissal”
Previous Approaches (2)

LB-Scan [Yi98]





Read every data sequence from database
Apply the lower-bound distance function Dlb which satisfies
the following lower-bound theorem:
Dlb (S,Q)    DTW (S,Q)  
Faster than the original time warping distance function
(O(|S|+|Q|) vs. O(|S||Q|))
Guarantee no false dismissal
Based on sequential scanning
Proposed Approach

Goal



No false dismissal
High query processing performance
Sketch



Extract a time-warping invariant feature vector
Build a multi-dimensional index
Use a lower-bound distance function for filtering
Proposed Approach (2)

Feature Extraction



F(S) =  First(S), Last(S), Max(S), Min(S) 
F(S) is invariant to time warping transformation.
Distance Function for Feature Vectors
| First(S)  First(Q) |
DFT (F(S), F(Q)) = max
| Last(S)  Last(Q) |
| Max(S)  Max(Q) |
| Min(S)  Min(Q) |
Proposed Approach (3)

Distance Function for Feature Vectors (2)



Satisfies lower-bounding theorem:
DFT (F(S),F(Q))    DTW (S,Q)  
More accurate than Dlb proposed in LB-Scan
Faster than Dlb (O(1) vs. O(|S|+|Q|))
Proposed Approach (4)

Indexing



Build a multi-dimensional index from a set of feature vectors
Index entry  First(S), Last(S), Max(S), Min(S), Identifier(S) 
Query Processing



Extract a feature vector F(Q)
Perform range queries in index space to find data points
included in the following query rectangle:
 [ First(Q)  , First(Q) +  ],[ Last(Q)  , Last(Q) +  ],
[ Max(Q)  , Max(Q) +  ], [ Min(Q)  , Min(Q) +  ] 
Perform post-processing to discard false alarms
Performance Evaluation

Implementation



Implemented with C++ on UNIX operating system
R-tree is used as a multi-dimensional index.
Experimental Setup



S&P 500 stock data set (m=545, L=232)
Random walk synthetic data set
SunSparc Ultra-5
Performance Evaluation (2)
Filtering Ratio

Better-than LB-Scan
7.00
6.00
Fitering Ratio (%)

5.00
Naïve-Scan
LB-Scan
Ours
4.00
3.00
2.00
1.00
0.00
2
4
distance-tolerance
6
Performance Evaluation (3)
Query Processing Time

Faster than LB-Scan and Naïve-Scan
elapsed time (sec)

1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Naïve-Scan
LB-Scan
Ours
2
4
distance-tolerance
6
Contents






Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion
Problem Definition

Input




Output


Set of data sequences {S}
Query sequence q
Distance tolerance 
Set of subsequences whose distances to q are within 
Similarity Measure


Time warping distance function, DTW
Any LP metric as a distance function for element pairs
Previous Approaches

Naïve-Scan [Ber96]



Read every data subsequence from database
Apply dynamic programming technique
For m data sequences with average length n, O(mL2|q|)
Previous Approaches (2)

ST-Index [Fal94]






Assume that the minimum query length (w) is known in
advance.
Locates a sliding window of size w at every possible location
Extract a feature vector inside the window
Map a feature vector into a point and group trails into MBR
(Minimum Bounding Rectangle)
Use Euclidean distance in index space for filtering
Could not guarantee “no false dismissal”
Proposed Approach

Goal




No false dismissal
High performance
Support diverse similarity measure
Sketch




Convert into sequences of discrete symbols
Build a sparse suffix tree
Use a lower-bound distance function for filtering
Apply branch-pruning to reduce the search space
Proposed Approach (2)

Conversion

Generate categories from the distribution of element values





Maximum-entropy method
Equal-interval method
DISC method
Convert element to the symbol of the corresponding
category
Example
A = [0, 1.0], B = [1.1, 2.0], C = [2.1, 3.0], D = [3.1, 4.0]
S = 1.3, 1.6, 2.9, 3.3, 1.5, 0.1
SC = B, B, C, D, B, A
Proposed Approach (3)

Indexing


Extract suffixes from sequences of discrete symbols.
Example
From S1C= A, B, B, A,
we extract four suffixes: ABBA, BBA, BA, A
Proposed Approach (4)

Indexing (2)

Build a suffix tree.





Suffix tree is originally proposed to retrieve substrings exactly
matched to the query string.
Suffix tree consists of nodes and edges.
Each suffix is represented by the path from the root node to a
leaf node.
Labels on the path from the root to the internal node Ni
represents the longest common prefix of the suffixes under Ni
Suffix tree is built with computation and space complexity,
O(mL).
Proposed Approach (4)

Indexing (3)

Example : suffix tree from S1C= A, B, B, A and S2C= A, B
A
B
$
$
A
B
$
B
B
A
$
A
$
$
S1C[1:-] S2C[1:-] S1C[4:-] S1C[2:-] S1C[3:-] S2C[2:-]
Proposed Approach (5)

Query Processing
query (q, )
Index
Searching
suffix tree
candidates
answers
Post
Processing
data sequences
Proposed Approach (6)

Index Searching

Visit each node of suffix tree by depth-first traversal.
Build lower-bound distance table for q and edge labels.
Inspect the last columns of newly added rows to find
candidates.
Apply branch-pruning to reduce the search space.

Branch-pruning theorem:



If all columns of the last row of the distance table have values
larger than a distance tolerance , adding more rows on this table
does not yield the new values less than or equal to .
Proposed Approach (7)

Index Searching (2)

Example : q = 2, 2, 1,  = 1.5
A 1 2 2
q 2 2 1
B 1 1 1.1
A 1 2 2
q 2 2 1
B
N3
…..
N1
A
…..
N2
D
N4
— —
…..
D 2.1 2.1 4.1
A 1
2 2
2 1
q 2
Proposed Approach (8)

Lower-Bound Distance Function DTW-LB
DBASE-LB (A, v) =
0
(A.min  v) P
(v  A.max) P
if v is within the range of A
if v is smaller than A.min
if v is larger than A.max
v
A.max
A.max
A.max
A.min
A.min
v
A.min
v
possible minimum
distance = 0
possible minimum
possible minimum
distance = (A.min – v)P distance = (v – A.max)P
Proposed Approach (9)

Lower-Bound Distance Function DTW-LB (2)
DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) +
DTW-LB (sC, q[2:-])
min
DTW-LB (sC[2:-], q)
DTW-LB (sC[2:-], q[2:-])


satisfies the lower-bounding theorem
DTW-LB(sC, q)    DTW (s,q)  
computation complexity O(|sC||q|)
Proposed Approach (10)

Computation Complexity
mL2 | q |
O(
 nL | q |)
RPRD







m is the number of data sequences.
L is the average length of data sequences.
The left expression is for index searching.
The right expression is for post-processing.
RP ( 1) is the reduction factor by branch-pruning.
RD ( 1) is the reduction factor by sharing distance tables.
n is the number of subsequences requiring post-processing.
Proposed Approach (11)

Sparse Indexing





The index size is linear to the number of suffixes stored.
To reduce the index size, we build a sparse suffix tree (SST).
That is, we store the suffix SC[i:-] only if SC[i]  SC[i–1].
Compaction Ratio
number of total suffixes
C
number of stored suffixes
Example



SC = A, A, A, A, C, B, B
store only three suffixes (SC[1:-], SC[5:-], and SC[6:-])
compaction ratio C = 7/3
Proposed Approach (12)

Sparse Indexing (2)




When traversing the suffix tree, we need to find non-stored
suffixes and compute their distances to q.
Assume that k elements of sC have the same value.
Then, sC[1:-] is stored but sC[i:-] (i=2,3,…,k) is not stored.
For non-stored suffixes,
we introduce another lower-bound distance function.
DTW-LB2 (sC[i:-], q) = DTW-LB(sC, q) – (i – 1)  DBASE-LB (sC[1], q[1])


DTW-LB2 satisfies the lower-bounding theorem.
DTW-LB2 is O(1) when DTW-LB(sC, q) is given.
Proposed Approach (13)

Sparse Indexing (3)

With sparse indexing, the complexity becomes:
mL2 | q |
1
O(
 (1  )mL  nL | q |)
CRPRD
C






m is the number of data sequences.
L is the average length of data sequences.
C is the compaction ratio.
n is the number of subsequences requiring post-processing.
RP ( 1) is the reduction factor by branch-pruning.
RD ( 1) is the reduction factor by sharing distance tables.
Performance Evaluation

Implementation


Implemented with C++ on UNIX operating system
Experimental Setup


S&P 500 stock data set (m=545, L=232)
Random walk synthetic data set

Maximum-Entropy (ME) categorization
Disk-based suffix tree construction algorithm

SunSparc Ultra-5

Performance Evaluation (2)
Comparison with Naïve-Scan


query processing time (sec)

increasing distance-tolerances
S&P 500 stock data set, |q|=20
250
200
150
Naïve-Scan
SST
100
50
0
5
10
20
30
40
distance-tolerance
50
Performance Evaluation (3)

Scalability Test


increasing average length of data sequences
random-walk data set, |q|=20,m=200
1400
1200
1000
query
processing time
(sec)
800
600
400
200
0
Naïve-Scan
SST
200
400
600
800
1000
52.84
215.08
486.05
864.08
1349.92
2.49
10.17
23.98
42.27
82.89
average length of data sequences
Performance Evaluation (4)

Scalability Test (2)


increasing total number of data sequences
random-walk data set, |q|=20, L=200
3000
2500
2000
query
processing time 1500
(sec)
1000
500
0
Naïve-Scan
SST
100
3000
6000
10000
266
798.71
1596.36
2679.9
21
60.35
124.49
215.92
total number of data sequences
Contents






Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion
Introduction


We extend the proposed subsequence searching
method to large sequence databases.
In the retrieval of similar subsequences with time
warping distance function,




Sequential Scanning is O(mL2|q|).
The proposed method is O(mL2|q| / R) (R  1).
It makes search algorithms suffer from severe
performance degradation when L is very large.
For a database with long sequences, we need a new
searching scheme linear to L.
SBASS

We propose a new searching scheme: SegmentBased Subsequence Searching scheme (SBASS)




Sequences are divided into a series of piece-wise segments.
When a query sequence q with k segments is submitted, q is
compared with those subsequences which consist of k
consecutive data segments.
The lengths of segments may be different.
SS represents the segmented sequence of S.
S = 4,5,8,9,11,8,4,3
|S| = 8
SS = 4,5,8,9,11, 8,4,3
|SS| = 2
SBASS (2)
S
SS[1]
SS[2]
SS[3]
SS[4]
SS[5]
SS
qS

qS[1]
qS[2]
Only four subsequences of SS are compared with QS.
SS[1],SS[2], SS[2],SS[3], SS[3],SS[4], SS[4],SS[5]
SBASS (3)

For SBASS scheme, we define the piece-wise time
warping distance function (where k = |qS| = |sS|).
k
Dptw (s , q )  ( (Dtw (s S [i], qS [i])) P ) 1/P
S
S
i1


Sequential scanning for SBASS scheme is O(mL|q|).
We introduce an indexing technique with O(mL|q|/R)
(R  1).
Sketch of Proposed Approach

Indexing






Convert sequences to segmented sequences.
Extract a feature vector from each segment.
Categorize feature vectors.
Convert segmented sequences to sequences of symbols.
Construct suffix tree from sequences of symbols.
Query Processing


Traverse the suffix tree to find candidates.
Discard false alarms in post processing.
Segmentation

Approach




Divide at peak points.
Divide further if maximum deviation from interpolation line is
too large.
Eliminate noises.
Compaction Ratio (C) = |S| / |SS|
too large deviation
noises
Feature Extraction

From each subsequence segment, extract a feature
vector:
(V1, VL,L, +, –)
VL
+
–
V1
L
Categorization and Index Construction

Categorization




Group similar feature vectors together using multidimensional categorization methods like Multi-attribute Type
Abstraction Hierarchy (MTAH).
Assign unique symbol to each category
Convert segmented sequences to sequences of symbols.
S = 4,5,8,8,8,8,9,11,8,4,3
SS = 4,5,8,8,8,8,9,11, 8,4,3
SF = (4,11,8,2,1), (8,3,3,0,1.5)
SC = A,B
From sequences of symbols, construct the suffix tree.
Query Processing


For query processing, we calculate lower-bond
distances between symbols and keep them in table.
Given the query sequence q and the distance
tolerance ,



Convert q to qS and then to qC.
Search the suffix tree to find those subsequences whose
lower-bound distances to qC are within .
Discard false alarms in post processing.
Query Processing (2)
q, 
qS
qC
Index
Searching
suffix tree
candidates
answers
Post
Processing
data sequences
Computation Complexity


Sequential scanning is O(mL|q|).
Complexity of the proposed search algorithm is :
O(



mL | q |
 mL | q |)
2
C RD
n is the number of subsequences contained in candidates.
C is the compaction ratio or the average number of elements
in segments.
RD ( 1) is the reduction factor by sharing edges of suffix
tree.
Performance Evaluation



Test Set : Pseudo Periodic Synthetic Sequences
m = 100, L = 10,000
Achieved up to 6.5 times speed-up compared to
sequential scanning.
60
50
time (sec)
40
SeqScan
30
20
Our Approach
10
0.2
0.4
0.6
0.8
distance tolerance
1.0
Contents






Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion
Introduction


So far, we assumed that elements have singledimensional numeric values.
Now, we consider multi-dimensional sequences.


Image Sequences
Video Streams
Medical Image Sequence
Introduction (2)

In multi-dimensional sequences, elements are
represented by feature vectors.
S = S[1], …, S[N], S[i] = (S[i][1], …, S[i][F])
 Our proposed subsequence searching techniques are
extended to the retrieval of similar multi-dimensional
subsequences.
Introduction (3)

Multi-Dimensional Time Warping Distance
DMTW (S, Q) = DMBASE (S[1], Q[1]) + min
F
DMBASE (S[1], Q[1]) =

DMTW (S, Q[2:-])
DMTW (S[2:-], Q)
DMTW (S[2:-],Q[2:-])
( Wi  | S[1][i]  Q[1][i] | )
i1


F is the number of features in each element.
Wi is the weight of i-th dimension.
Sketch of Our Approach

Indexing





Categorize multi-dimensional element values using MTAH.
Assign unique symbols to categories.
Convert multi-dimensional sequences into sequences of
symbols.
Construct suffix tree from a set of sequences of symbols.
Query Processing



Traverse suffix tree.
Find candidates whose lower-bound distances to q are
within .
Do post processing to discard false alarms.
Application to KMeD
 In the environment of KMeD, the proposed technique
is applied to the retrieval of medical image sequences
having similar spatio-temporal characteristics to those
of the query sequence.
 KMeD [CCT:95] has the following features:




Query by both image and alphanumeric contents
Model temporal, spatial and evolutionary nature of objects
Formulate queries using conceptual and imprecise terms
Support cooperative processing
Application to KMeD (2)
 Query
 Medical Image Sequence
 Attribute names and their relative weights
 Distance tolerance
DistFromLV
(0.6)
Circularity
(0.1)
Size
(0.3)
Application to KMeD (3)
Query
User Model
Query Analysis
Contour Extraction
Feature Extraction
Distance Function
matching seq.
Visual Presentation
Similarity Searches
feedback
medical image seq.
index structure
Contents






Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion
Summary







Sequence is an ordered list of elements.
Similarity search helps in clustering and data mining.
For sequences of different lengths or different sampling
rates, time warping distance is useful.
We proposed the whole sequence searching method with
spatial access method and lower-bound distance function.
We proposed the subsequence searching method with suffix
tree and lower-bound distance functions.
We proposed the segment-based subsequence searching
method for large sequence databases.
We extended the subsequence searching method to the
retrieval of similar multi-dimensional subsequences.
Contribution





We proposed the tighter and faster lower-bound distance
function for efficient whole sequence searches without false
dismissal.
We demonstrated the feasibility of using time warping
similarity measure on a suffix tree.
We introduced the branch pruning theorem and the fast
lower-bound distance function for efficient subsequence
searches without false dismissal.
We applied categorization and sparse indexing for scalability.
We applied the proposed technique to the real application
(KMeD).
Download