presentation

advertisement
Approximate Point Set Pattern Matching
on Sequences and Planes
Tomoaki Suga,
Shinichi Shimozono*
Kyushu Inst. of Tech.
Fukuoka, Japan
Point Set Pattern Matching
Text: A set of points in, ex., a plane
Pattern: A small set of points
Task: Find an occurrence of the pattern as a subset
PATTERN
TEXT
Approximate Point Set Matching in
Practice: Example
Analysis of 2D electrophoresis images
A set of spots on gel media plane
Searching digital music score by melody
Ringer melody, Internet contents, Online “Kara-Oke”
Literature
Exact matching in d-dimension
Geometric algorithm by P. J. de. Rezende & D. T. Lee, '95
Transfer, Scaling, and Rotation in O(nmd)
Allowing local distortions
Huristic and Hardness by Akutsu et al., '99
…NP-hard even in 1D matching
Approximate matching of point sequences
No-skips, O(nm) time by V. Makinen '01
Allowing substitution in O(nm3) time
Extension to 2-dimensional matching is NP-hard
Our Results
Approximate point set pattern matching in 1D
Pattern matches as a subset: Extends Makinen et al.
Simple fast algorithm dealing with O(nm2) task
By reasonable assumption on sequences in practice
Algorithm guarantees O(nm) time
Linear with text-size by average-constant time min. query
Four-Russian Speed-up
Observation connected to string matching
2D approximate point set pattern matching
With polynomial-time algorithm
1D Matching As a Target
As a basis of practical problems
Axes of 2D electrophoresis images are independent
Points in higher dimension but having the primary axis
(sort order) … ex. 3D structure of proteins
Musical score search
Pitch error (tone deafness) is usually fatal
Exact matching in Rhythm/Timing is impractical,
but indispensable to distinguish melodies
Point Set Matching in 1D
Text and Pattern: Strictly increasing sequences of Integers
T = (t 1, K , t m ),
P = (p1, K , pn )
An Occurrence of the Pattern:
A Subsequence of the Text
T ¢= (t l (1), K , t l n
( )
)
Edit Distance for
Point Set Approximate Matching
Distance between two same size sequences:
n
d (P ,Q ) =
å
i= 2
P
Q
pi - pi - 1 - (qi - qi - 1 )
Approximate Matching and Recurrence
D(i,j) = Distance between First i Points of Pattern and
best Occurrence of it in Text ending at j
{
D (i, j ) = min D (i - 1, k ) + pi - pi - 1 - (t j - t k )
i - 1£ k < j
Distance between one-small
prefix-sequences
}
Difference of the
last two distances
D(n,m) can be obtained by Tabular Computation …
in O(nm2) time
“Finite Resolution” Assumption on
a Class of Sequences
Ratio of distances between two contiguous points is
limited
Spots observed as stains on small gel media plane
450 ticks per second in typical MIDI sequences
Pattern
Text
Modified algorithm runs in O(nm) time if sequences
have finite resolution
The 3rd iteration can be finished in constant time…
A Row can be Divided into
“Positive” Part & “Negative” Part
Values in “Negative” part always decrease
“Ex-Minimum” can only be a candidate
Only a constant number of “Positive” cells exist if
sequences have finite resolution … O(nm) time
large ¬ (t j - t k ) ® small ³ 0
j
¬
k ®
i- 1
i
pi - pi - 1 - (t j - t k ) < 0
L
Guaranteed O(nm)-time Algorithm
Using “deque” simulating the right-most path of the
Cartesian Tree [Gabow, et al., 1985]
Maintains to-be-minimum indices in “Positive” part
Min is available in amortized constant time
Remove if turned
to negative
¬
Pop all larger ones
k
®
…
Min.
Constant time in average for one iteration
… O(nm) time
j Push the
latest index
Computational Results on
Real/Synthesized* MIDI Sequences
Simple algorithm expecting “Finite Resolution” is
faster than O(nm) time algorithm
Pattern Size = 11, Time (sec.) for filling-up table
Text Size
Naïve DP Fin. Res. Cartesian
3086
1.12
0.01
0.01
*18328
197
0.03
0.05
*37741
883
0.05
0.09
*386801
--0.58
0.94
Solaris 9 x86/Intel Pentium 4 800MHz
Four-Russian Speed-up for
Point Sequences with Finite Resolution
Idea from Arlazarov et al.: Filling tabular cells by precomputed values
O(nm/log n + n log n) time with unit-cost RAM model
As we can suppose,
finite resolution assumption makes point sequences
being like strings
Approximate Point Set Pattern
Matching on the Plane: Hardness
Results
Akutsu et al. (’95), allowing local distortions
NP-hard, even in 1D matching
V. Makinen & E. Ukkonen ('01), an extension of 1D
NP-hard; deciding the order of points in matching is hard
Q. Is there any non-trivial 2D approximate point set
matching computable in polynomia-time?
Extending 1D Definition to
Approximate Matching on the Plane
Regard a set as sequences with two orders
Divide recursively by axis-parallel lines
P
Q
Recurrence for Edit Distance
Divide P and Q into two arbitrary parts, by either a
horizontal or a vertical lines
if P [i, j ] = Q [k , l ] = 1 t hen d (i, j ; k , l ) = 0,
if P [i, j ] ¹ Q [k , l ] t hen d (i, j ; k , l ) = ¥ , and
d (i, j ; k , l ) =
ìï
ü
ïï d ([i, j ]- ;[k , l ]- ) + p([i , j ]) - p([i , j ] ) - q([k ,l ]) - q([k ,l ] ) , ïïï
- R
- R
R
R
ï
ïý
min í
ïï d [i, j ]- ;[k , l ]- + p
ïï
q
q
(
)
T - p
T
T
T
ïï
([k ,l ])
ïï
([i , j ])
([i , j ]- )
([k ,l ]- )
î
þ
(
)
(
)
How Pattern Matching Proceeds
x Points of a pattern should be aligned on o points of
a text, by cutting and moving the bounding box
Polynomia-time Algorithm for 2D
Approximate Point Set Matching
Finds the best partition/direction by DP-like recursion
Results are stored in cache for quadruples [I, j; k, l]
… O(n2 m2) space
O(n2m4) time with pattern size n and text size m
Remarks & Future Works
Consider scaling in 1D
Tempo must be considered in musical sequence search
Looking for more applications
1D approximate matching to secondary structure search of
proteins
Point Set Pattern Matching
Text: A set of points in, e.g., a plane
Pattern: A small set of points
Task: Find an occurrence of the pattern as a subset
PATTERN
TEXT
Point Set Pattern Matching
Text: A set of points in, e.g., a plane
Pattern: A small set of points
Task: Find an occurrence of the pattern as a subset
PATTERN
TEXT
Approximate Point Set Matching in
Practice: Example
Analysis of 2D electrophoresis images
A set of spots on gel media plane
Searching digital music score by melody
Ringer melody, Internet contents, Online “Kara-Oke”
A Row can be Divided into
“Positive” Part & “Negative” Part
Absolute values in “Negative” part always increase
“Ex-Minimum” can only be a candidate
Only a constant number of “Positive” cells exist if
sequences have finite resolution … O(nm) time
large ¬ (t j - t k ) ® small ³ 0
j
¬
k ®
i- 1
i
pi - pi - 1 - (t j - t k ) < 0
L
A Row can be Divided into
“Positive” Part & “Negative” Part
Absolute values in “Negative” part always increase
“Ex-Minimum” can only be a candidate
Only a constant number of “Positive” cells exist if
sequences have finite resolution … O(nm) time
large ¬ (t j - t k ) ® small ³ 0
¬
k ®
i- 1
i
pi - pi - 1 - (t j - t k ) < 0
L
j
Extending 1D Definition to
Approximate Matching on the Plane
Regard a set as sequences with two orders
Divide recursively by axis-parallel lines
P
Q
Extending 1D Definition to
Approximate Matching on the Plane
Regard a set as sequences with two orders
Divide recursively by axis-parallel lines
P
Q
How Pattern Matching Proceeds
x Points of a pattern should be aligned on o points of
a text, by cutting and moving the bounding box
Download