Fast Approximate Point Set Matching for Information Retrieval

advertisement
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
ben.sach.05@bristol.ac.uk
Contents
•
•
•
•
•
What’s the problem?
What use is it?
Is it (3-SUM) hard?
How have we solved it?
How good is our solution?
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
The Maximal Subset Matching problem
Given a pattern, P
and a text, T:
- of size n
We want to find the largest
- of size m
“match” of P in T
This is also referred to as the “constellation” problem
(originally by B. Chazelle)
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
The Maximal Subset Matching problem
What is a
“match”?
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
The Maximal Subset Matching problem
What is a
“match”?
A point pi in P matches
a point tj in T with a shift, v if:
pi + v = tj
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
The Maximal Subset Matching problem
What is a
“match”?
A subset of P, M is a subset
match if:
There exists a shift, v, with which all
points in M match points in T
The Maximal Subset Matching problem is…
to find the size of the largest subset match for a given P,
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
T
Application to Music Information Retrieval
• Allows for matches shifted in time and pitch
• Intrinsically handles polyphonic music which
traditional string based methods do not
Other Applications:
•
•
Protein structure alignment
Pharmacophore identification
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
• Image registration
• Model-based object recognition
Is Maximal Subset Matching hard?
3-SUM is…
Given a set T
of n integers:
Is there a triple a; b; c 2 T
such that a + b + c = 0?
• There is a simple algorithm to solve 3-SUM in O(n2) time
• No lower complexity solutions are known
• It is conjectured that this is a lower bound
“Many fundamental geometric problems fall in this class”
Maximum Subset Matching has been proven to be 3-SUM HARD
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
The Algorithms
MSMFT
MSMBP
•
•
•
•
Bit-parallel implementation
O(nm) time
O(n) space with very low constants
Cross-correlation implemented via Bit-sets
•
•
•
•
FFT based implementation
O(n*log(m)) time
O(n) space
Cross-correlation implemented via
Fast Fourier Transforms
The Structure
1.
Randomly project the pattern and the text into 1D
2.
“Length reduce” the data to decrease sparsity
3.
Perform a cross-correlation at each alignment of the length reduced pattern and text
4.
Find the shift in the length reduced pattern that gave the largest value in the
cross-correlation
5.
Using the “improved estimate”, infer the shift in the original data.
6.
Return the size of the match with this shift.
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
(a) Randomised Projection and (b) Length Reduction
•Projected pattern
points are mapped to h(x) in
the pattern binary array
•Projected text points are
mapped to h(x) and h2(x) in
the text binary array
binary array of length r*n
•Both arrays are of length r*n,
where n is the number of text
points
Using hash functions:
g(x) = ax mod q, h(x) = g(x) mod s and h2(x) = (g(x) + q) mod s
Where:
q = a random prime in [2N,…,4N]
(N is the maximum of the projected values of P’ and T’)
a = a random in [1,…,q-1]
(See Cole and Hariharan [3])
s = r*n, where r>1 is a constant
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Why does this work?
(
Lemma 1:
h(x + y) if g(x) + g(y) < q,
(h(x) + h(y)) mod s =
h2(x + y) otherwise
Significance:
•
•
If some point matches so that p + v = t then (h(p) +h(v)) mod s matches either h(t) or h2(t)
By counting the number of 1’s in common at each alignment we can estimate the true
subset match in the original data
Proof:
(h(x) + h(y)) mod s = (g(x) mod s + g(y) mod s) mod s
(As h(x) = g(x) mod s)
= (g(x) + g(y)) mod s.
If g(x) +g(y) < q, then g(x+y) = g(x) +g(y).
If g(x) + g(y) ¸ q, then g(x + y) = g(x) + g(y) ¡ q.
(As g(x) = ax mod q)
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Estimating the Size of the Largest Subset Match
•
Estimation based on projected and length reduced matches:
high variance which grows linearly as the number of true matches
decreases (discussed in paper)
•
An improved Estimate:
1.
Find the best match of the length reduced pattern in the text.
2.
Determine in O(m) time which points in the reduced pattern match the text at
that shift.
3.
Look up, by the use of a precalculated hash table, where each of the matching
points where matched from in the 1D projection, P’ and T’.
4.
Now we have a shift for each pair of points in P’ and T’. This may have rareinconsistencies due to collisions. We therefore perform a count and take the
most frequent shift.
1.
Finally we return the size of the match at this shift.
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
When does this work?
Bit-Parallel Cross-correlation (MSMBP)
We store the reduced pattern and text arrays as bitsets and perform a
bit-parallel correlation using ANDs and counts:
– Correlation of two architectural words can be found using an AND followed by a count
of the number of 1’s in the result in constant time
– Count implemented by use of a look-up table.
– Each reduced array is of size r*n so the bitset has O(n) words so gives each
correlation in O(n) time
– We need to find the correlation at each shift.
– To shift the text we must shift every word in the text so takes O(n) time again.
(O(n) + O(n))O(n) = O(n2 )
(Correlation)
(Shift) (Alignments)
Therefore, naively, this method takes O(n2) time
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Bit-Parallel Cross-correlation (MSMBP)
We reduce this complexity by taking advantage of the sparseness
of the reduced pattern array when m << n:
–
p has O(n) words but only O(m) non-zero values:
•
•
we only store these at worst m words.
this reduces each correlation computation to O(m) time
However, we also need to reduce the number of shifts required:
|01010010|01000100|01011011|10000100|10100100|10010010|…
By use of pointer arithmetic,
we can align the data to any
constant*b alignment (where b is the byte-size)
in constant time
|10100100|10001000|10110111|00001001|01001001|00100100|…
A single full shift of t
gives us access to alignments
c*b +1 for any c
So by calculating the correlations out of order, we need to perform only b shifts
This results in an O(nm) time complexity algorithm
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
FFT Cross-correlation (MSMFT)
Uses the same steps as MSMBP except the cross-correlation step is
implemented using FFTs (Fast Fourier Transforms):
This uses the property of the FFT that for numerical strings:
p ¢ t(i)
def
=
X
m
pj t(i+j ¡1) ; 1 · i · n;
j=1
(Where t(i) is the m length substring of t, beginning at position i)
This can be calculated accurately and efficiently in O(n*log(m)) time
(thanks to the FFTW team for the implementation used, see [5])
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Speed Comparisons (1)
Increasing Text size with proportional Pattern size (25%,75%)
(P3 is the queue based method of Ukkonen at Al. [7] with complexity O(n*m*log(m))
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Speed Comparisons (2)
Increasing Text size with fixed
Pattern size (40 points)
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Constant Text size (960000 points)
with increasing Pattern size
Accuracy Tests
Match % - The percentage of the pattern that
existed in the text
Actual – The sizes of the actual best matches
Run 1,2,3 – The sizes of the matches found
by the algorithm in each test.
Avr. Diff – The average percentage of the
largest present match that was returned.
The text used was 4000 points in both cases
Match
%
Actual
Run
1
Run
2
Run
3
Avr.
Diff
90%
180
180
180
180
100%
75%
150
150
150
150
100%
25%
50
50
50
50
100%
10%
20
4
5
5
23%
Match %
Actual
Run
1
1st , 2nd
1st , 2nd
100%,10%
200,20
200
100%,50%
200,100
100%,90%
Run
2
Run
3
Avr.
Diff
200
200
100%
200
200
200
100%
200,180
200
200
200
100%
100%,99%
200,198
200
200
200
100%
75%,10%
150,20
150
150
150
100%
75%,65%
150,130
150
150
150
100%
75%,70%
150,140
150
150
140
98%
75%,73%
150,146
150
150
150
100%
50%,10%
100,20
100
100
100
100%
50%,40%
100,80
100
100
100
100%
50%,45%
100,90
100
100
90
97%
25%, 5%
50,10
50
50
50
100%
25%,15%
50,30
50
50
50
100%
25%,20%
50,40
40
50
50
93%
Only MSMBP was used for accuracy testing
as the two algorithms differ only in performance
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Conclusions
•
We have presented two algorithms, MSMBP with O(nm) and MSMFT with
O(n*log(m)) time complexity, both with O(n) space
•
We have shown that these are efficient on large random point sets
•
We have also shown that the accuracy is very high, even in situations
theorised in the paper to have a lower probability of success.
•
We have shown experimentally speed ups of several orders of magnitude in
some cases without a significant decrease in accuracy
The Authors would like to thank Manolis Christodoulakis for the original
implementation of the MSMFT algorithm and the EPSRC for the funding of
the second author.
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Questions?
(from xkcd.com)
Fast Approximate Point Set Matching for Information Retrieval
Raphaël Clifford and Benjamin Sach
Download