Approximate Point Set Pattern Matching on Sequences and Planes Tomoaki Suga, Shinichi Shimozono* Kyushu Inst. of Tech. Fukuoka, Japan Point Set Pattern Matching Text: A set of points in, ex., a plane Pattern: A small set of points Task: Find an occurrence of the pattern as a subset PATTERN TEXT Approximate Point Set Matching in Practice: Example Analysis of 2D electrophoresis images A set of spots on gel media plane Searching digital music score by melody Ringer melody, Internet contents, Online “Kara-Oke” Literature Exact matching in d-dimension Geometric algorithm by P. J. de. Rezende & D. T. Lee, '95 Transfer, Scaling, and Rotation in O(nmd) Allowing local distortions Huristic and Hardness by Akutsu et al., '99 …NP-hard even in 1D matching Approximate matching of point sequences No-skips, O(nm) time by V. Makinen '01 Allowing substitution in O(nm3) time Extension to 2-dimensional matching is NP-hard Our Results Approximate point set pattern matching in 1D Pattern matches as a subset: Extends Makinen et al. Simple fast algorithm dealing with O(nm2) task By reasonable assumption on sequences in practice Algorithm guarantees O(nm) time Linear with text-size by average-constant time min. query Four-Russian Speed-up Observation connected to string matching 2D approximate point set pattern matching With polynomial-time algorithm 1D Matching As a Target As a basis of practical problems Axes of 2D electrophoresis images are independent Points in higher dimension but having the primary axis (sort order) … ex. 3D structure of proteins Musical score search Pitch error (tone deafness) is usually fatal Exact matching in Rhythm/Timing is impractical, but indispensable to distinguish melodies Point Set Matching in 1D Text and Pattern: Strictly increasing sequences of Integers T = (t 1, K , t m ), P = (p1, K , pn ) An Occurrence of the Pattern: A Subsequence of the Text T ¢= (t l (1), K , t l n ( ) ) Edit Distance for Point Set Approximate Matching Distance between two same size sequences: n d (P ,Q ) = å i= 2 P Q pi - pi - 1 - (qi - qi - 1 ) Approximate Matching and Recurrence D(i,j) = Distance between First i Points of Pattern and best Occurrence of it in Text ending at j { D (i, j ) = min D (i - 1, k ) + pi - pi - 1 - (t j - t k ) i - 1£ k < j Distance between one-small prefix-sequences } Difference of the last two distances D(n,m) can be obtained by Tabular Computation … in O(nm2) time “Finite Resolution” Assumption on a Class of Sequences Ratio of distances between two contiguous points is limited Spots observed as stains on small gel media plane 450 ticks per second in typical MIDI sequences Pattern Text Modified algorithm runs in O(nm) time if sequences have finite resolution The 3rd iteration can be finished in constant time… A Row can be Divided into “Positive” Part & “Negative” Part Values in “Negative” part always decrease “Ex-Minimum” can only be a candidate Only a constant number of “Positive” cells exist if sequences have finite resolution … O(nm) time large ¬ (t j - t k ) ® small ³ 0 j ¬ k ® i- 1 i pi - pi - 1 - (t j - t k ) < 0 L Guaranteed O(nm)-time Algorithm Using “deque” simulating the right-most path of the Cartesian Tree [Gabow, et al., 1985] Maintains to-be-minimum indices in “Positive” part Min is available in amortized constant time Remove if turned to negative ¬ Pop all larger ones k ® … Min. Constant time in average for one iteration … O(nm) time j Push the latest index Computational Results on Real/Synthesized* MIDI Sequences Simple algorithm expecting “Finite Resolution” is faster than O(nm) time algorithm Pattern Size = 11, Time (sec.) for filling-up table Text Size Naïve DP Fin. Res. Cartesian 3086 1.12 0.01 0.01 *18328 197 0.03 0.05 *37741 883 0.05 0.09 *386801 --0.58 0.94 Solaris 9 x86/Intel Pentium 4 800MHz Four-Russian Speed-up for Point Sequences with Finite Resolution Idea from Arlazarov et al.: Filling tabular cells by precomputed values O(nm/log n + n log n) time with unit-cost RAM model As we can suppose, finite resolution assumption makes point sequences being like strings Approximate Point Set Pattern Matching on the Plane: Hardness Results Akutsu et al. (’95), allowing local distortions NP-hard, even in 1D matching V. Makinen & E. Ukkonen ('01), an extension of 1D NP-hard; deciding the order of points in matching is hard Q. Is there any non-trivial 2D approximate point set matching computable in polynomia-time? Extending 1D Definition to Approximate Matching on the Plane Regard a set as sequences with two orders Divide recursively by axis-parallel lines P Q Recurrence for Edit Distance Divide P and Q into two arbitrary parts, by either a horizontal or a vertical lines if P [i, j ] = Q [k , l ] = 1 t hen d (i, j ; k , l ) = 0, if P [i, j ] ¹ Q [k , l ] t hen d (i, j ; k , l ) = ¥ , and d (i, j ; k , l ) = ìï ü ïï d ([i, j ]- ;[k , l ]- ) + p([i , j ]) - p([i , j ] ) - q([k ,l ]) - q([k ,l ] ) , ïïï - R - R R R ï ïý min í ïï d [i, j ]- ;[k , l ]- + p ïï q q ( ) T - p T T T ïï ([k ,l ]) ïï ([i , j ]) ([i , j ]- ) ([k ,l ]- ) î þ ( ) ( ) How Pattern Matching Proceeds x Points of a pattern should be aligned on o points of a text, by cutting and moving the bounding box Polynomia-time Algorithm for 2D Approximate Point Set Matching Finds the best partition/direction by DP-like recursion Results are stored in cache for quadruples [I, j; k, l] … O(n2 m2) space O(n2m4) time with pattern size n and text size m Remarks & Future Works Consider scaling in 1D Tempo must be considered in musical sequence search Looking for more applications 1D approximate matching to secondary structure search of proteins Point Set Pattern Matching Text: A set of points in, e.g., a plane Pattern: A small set of points Task: Find an occurrence of the pattern as a subset PATTERN TEXT Point Set Pattern Matching Text: A set of points in, e.g., a plane Pattern: A small set of points Task: Find an occurrence of the pattern as a subset PATTERN TEXT Approximate Point Set Matching in Practice: Example Analysis of 2D electrophoresis images A set of spots on gel media plane Searching digital music score by melody Ringer melody, Internet contents, Online “Kara-Oke” A Row can be Divided into “Positive” Part & “Negative” Part Absolute values in “Negative” part always increase “Ex-Minimum” can only be a candidate Only a constant number of “Positive” cells exist if sequences have finite resolution … O(nm) time large ¬ (t j - t k ) ® small ³ 0 j ¬ k ® i- 1 i pi - pi - 1 - (t j - t k ) < 0 L A Row can be Divided into “Positive” Part & “Negative” Part Absolute values in “Negative” part always increase “Ex-Minimum” can only be a candidate Only a constant number of “Positive” cells exist if sequences have finite resolution … O(nm) time large ¬ (t j - t k ) ® small ³ 0 ¬ k ® i- 1 i pi - pi - 1 - (t j - t k ) < 0 L j Extending 1D Definition to Approximate Matching on the Plane Regard a set as sequences with two orders Divide recursively by axis-parallel lines P Q Extending 1D Definition to Approximate Matching on the Plane Regard a set as sequences with two orders Divide recursively by axis-parallel lines P Q How Pattern Matching Proceeds x Points of a pattern should be aligned on o points of a text, by cutting and moving the bounding box