Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach ben.sach.05@bristol.ac.uk Contents • • • • • What’s the problem? What use is it? Is it (3-SUM) hard? How have we solved it? How good is our solution? Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem Given a pattern, P and a text, T: - of size n We want to find the largest - of size m “match” of P in T This is also referred to as the “constellation” problem (originally by B. Chazelle) Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem What is a “match”? Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem What is a “match”? A point pi in P matches a point tj in T with a shift, v if: pi + v = tj Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem What is a “match”? A subset of P, M is a subset match if: There exists a shift, v, with which all points in M match points in T The Maximal Subset Matching problem is… to find the size of the largest subset match for a given P, Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach T Application to Music Information Retrieval • Allows for matches shifted in time and pitch • Intrinsically handles polyphonic music which traditional string based methods do not Other Applications: • • Protein structure alignment Pharmacophore identification Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach • Image registration • Model-based object recognition Is Maximal Subset Matching hard? 3-SUM is… Given a set T of n integers: Is there a triple a; b; c 2 T such that a + b + c = 0? • There is a simple algorithm to solve 3-SUM in O(n2) time • No lower complexity solutions are known • It is conjectured that this is a lower bound “Many fundamental geometric problems fall in this class” Maximum Subset Matching has been proven to be 3-SUM HARD Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Algorithms MSMFT MSMBP • • • • Bit-parallel implementation O(nm) time O(n) space with very low constants Cross-correlation implemented via Bit-sets • • • • FFT based implementation O(n*log(m)) time O(n) space Cross-correlation implemented via Fast Fourier Transforms The Structure 1. Randomly project the pattern and the text into 1D 2. “Length reduce” the data to decrease sparsity 3. Perform a cross-correlation at each alignment of the length reduced pattern and text 4. Find the shift in the length reduced pattern that gave the largest value in the cross-correlation 5. Using the “improved estimate”, infer the shift in the original data. 6. Return the size of the match with this shift. Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach (a) Randomised Projection and (b) Length Reduction •Projected pattern points are mapped to h(x) in the pattern binary array •Projected text points are mapped to h(x) and h2(x) in the text binary array binary array of length r*n •Both arrays are of length r*n, where n is the number of text points Using hash functions: g(x) = ax mod q, h(x) = g(x) mod s and h2(x) = (g(x) + q) mod s Where: q = a random prime in [2N,…,4N] (N is the maximum of the projected values of P’ and T’) a = a random in [1,…,q-1] (See Cole and Hariharan [3]) s = r*n, where r>1 is a constant Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Why does this work? ( Lemma 1: h(x + y) if g(x) + g(y) < q, (h(x) + h(y)) mod s = h2(x + y) otherwise Significance: • • If some point matches so that p + v = t then (h(p) +h(v)) mod s matches either h(t) or h2(t) By counting the number of 1’s in common at each alignment we can estimate the true subset match in the original data Proof: (h(x) + h(y)) mod s = (g(x) mod s + g(y) mod s) mod s (As h(x) = g(x) mod s) = (g(x) + g(y)) mod s. If g(x) +g(y) < q, then g(x+y) = g(x) +g(y). If g(x) + g(y) ¸ q, then g(x + y) = g(x) + g(y) ¡ q. (As g(x) = ax mod q) Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Estimating the Size of the Largest Subset Match • Estimation based on projected and length reduced matches: high variance which grows linearly as the number of true matches decreases (discussed in paper) • An improved Estimate: 1. Find the best match of the length reduced pattern in the text. 2. Determine in O(m) time which points in the reduced pattern match the text at that shift. 3. Look up, by the use of a precalculated hash table, where each of the matching points where matched from in the 1D projection, P’ and T’. 4. Now we have a shift for each pair of points in P’ and T’. This may have rareinconsistencies due to collisions. We therefore perform a count and take the most frequent shift. 1. Finally we return the size of the match at this shift. Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach When does this work? Bit-Parallel Cross-correlation (MSMBP) We store the reduced pattern and text arrays as bitsets and perform a bit-parallel correlation using ANDs and counts: – Correlation of two architectural words can be found using an AND followed by a count of the number of 1’s in the result in constant time – Count implemented by use of a look-up table. – Each reduced array is of size r*n so the bitset has O(n) words so gives each correlation in O(n) time – We need to find the correlation at each shift. – To shift the text we must shift every word in the text so takes O(n) time again. (O(n) + O(n))O(n) = O(n2 ) (Correlation) (Shift) (Alignments) Therefore, naively, this method takes O(n2) time Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Bit-Parallel Cross-correlation (MSMBP) We reduce this complexity by taking advantage of the sparseness of the reduced pattern array when m << n: – p has O(n) words but only O(m) non-zero values: • • we only store these at worst m words. this reduces each correlation computation to O(m) time However, we also need to reduce the number of shifts required: |01010010|01000100|01011011|10000100|10100100|10010010|… By use of pointer arithmetic, we can align the data to any constant*b alignment (where b is the byte-size) in constant time |10100100|10001000|10110111|00001001|01001001|00100100|… A single full shift of t gives us access to alignments c*b +1 for any c So by calculating the correlations out of order, we need to perform only b shifts This results in an O(nm) time complexity algorithm Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach FFT Cross-correlation (MSMFT) Uses the same steps as MSMBP except the cross-correlation step is implemented using FFTs (Fast Fourier Transforms): This uses the property of the FFT that for numerical strings: p ¢ t(i) def = X m pj t(i+j ¡1) ; 1 · i · n; j=1 (Where t(i) is the m length substring of t, beginning at position i) This can be calculated accurately and efficiently in O(n*log(m)) time (thanks to the FFTW team for the implementation used, see [5]) Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Speed Comparisons (1) Increasing Text size with proportional Pattern size (25%,75%) (P3 is the queue based method of Ukkonen at Al. [7] with complexity O(n*m*log(m)) Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Speed Comparisons (2) Increasing Text size with fixed Pattern size (40 points) Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Constant Text size (960000 points) with increasing Pattern size Accuracy Tests Match % - The percentage of the pattern that existed in the text Actual – The sizes of the actual best matches Run 1,2,3 – The sizes of the matches found by the algorithm in each test. Avr. Diff – The average percentage of the largest present match that was returned. The text used was 4000 points in both cases Match % Actual Run 1 Run 2 Run 3 Avr. Diff 90% 180 180 180 180 100% 75% 150 150 150 150 100% 25% 50 50 50 50 100% 10% 20 4 5 5 23% Match % Actual Run 1 1st , 2nd 1st , 2nd 100%,10% 200,20 200 100%,50% 200,100 100%,90% Run 2 Run 3 Avr. Diff 200 200 100% 200 200 200 100% 200,180 200 200 200 100% 100%,99% 200,198 200 200 200 100% 75%,10% 150,20 150 150 150 100% 75%,65% 150,130 150 150 150 100% 75%,70% 150,140 150 150 140 98% 75%,73% 150,146 150 150 150 100% 50%,10% 100,20 100 100 100 100% 50%,40% 100,80 100 100 100 100% 50%,45% 100,90 100 100 90 97% 25%, 5% 50,10 50 50 50 100% 25%,15% 50,30 50 50 50 100% 25%,20% 50,40 40 50 50 93% Only MSMBP was used for accuracy testing as the two algorithms differ only in performance Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Conclusions • We have presented two algorithms, MSMBP with O(nm) and MSMFT with O(n*log(m)) time complexity, both with O(n) space • We have shown that these are efficient on large random point sets • We have also shown that the accuracy is very high, even in situations theorised in the paper to have a lower probability of success. • We have shown experimentally speed ups of several orders of magnitude in some cases without a significant decrease in accuracy The Authors would like to thank Manolis Christodoulakis for the original implementation of the MSMFT algorithm and the EPSRC for the funding of the second author. Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Questions? (from xkcd.com) Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach