[presentation file]

advertisement
Embedding and Similarity Search for Point
Sets under Translation
Minkyoung Cho and David M. Mount
University of Maryland
SoCG 2008
1
Point Pattern Matching
P
Point Pattern Matching
Given two point sets P, Q, find Q’  Q
to minimize
Dist(P, Q’) = min dist(tP, Q’)
where t is a geometric transformation.
(e.g., translation, rotation, …)
Q
2
Point Pattern Similarity Search
Point Pattern Similarity Search
…
…
…
…
Q
A collection of point sets
S={P1,P2,…,PN}
has been preprocessed. Given a
query set Q, find (approximate)
nearest Pi with respect to a
distance function and
transformation group.
S = {P1, P2, …, PN}
3
Results
Transformation
Geometric Hashing
Space
Space
complexity
O(Nn)
YES
Embedding
EMD to L1
EMD under
Scaling
transformation sets Translation
O(Nn)
NO
Brute-force,
Heuristic
Ours
O(Nn log2n ) YES
EMDM into
Euclidean space
O(Nnk+1)
None
Note
YES
[Wolfson & Rigoutsos 97]
Translation
Rotation
Affine …
Index
(k: frame size)
[Indyk & Thaper03]
[Cohen & Guibas99]
Translation
EMD: Earth Mover’s Distance
SD: Symmetric Difference Distance
Embedding
SD to L1
4
Problem Definition
Point Pattern Similarity Searching:
• Distance Measure:
Symmetric Difference Distance
PΔQ  P \ Q  Q \ P
P = {p1,p2,p3,p4}
Q = {p1,p2,p5,p6}
PΔQ  {p3, p4}  {p5, p6}
•
Error Model:
Outliers (but No Noise)
•
Transformation:
Translation
•
Restriction:
Coordinates are integers
PΔQ  4
P = {0,12,14,23,35,54,59,64}
{0,12,14,23,35,54,59,64}
{12,14,17,23,35,54,62,64}
{ 12,14,23,35,54,
64}
t=3
Q = {15,17,20,26,38,57,65,67}
…
…
…
… …
… …
P …
Q
5
Motivation: Sources of Complexity
• Combination of Translation + Outliers
• Translation Only
- translate the point set by aligning leftmost point to the origin
- trivial matching
• Outliers Only
- Reduce to Nearest neighbor search in Hamming cube
(By hashing or random sampling)
6
Intuition
Q
P1
P2
P3
P4
PN
f
f
f
f
f
f
Metric space
7
Embedding: Basic Definitions
Given metric spaces (X, d) and (X', d'), a map f: X  X’ is called an
embedding.
The contraction of f is the maximum factor by which distances
d(x, y)
are shrunk, i.e., max
x, yX
d'(f(x), f( Y))
The expansion or stretch of f is the maximum factor by
which distances are stretched:
d'(f(x), f(Y))
x, yX
d(x, y)
max
The distortion of f is the product of the contraction and expansion.
8
Main Result: Preliminaries
• Main result: There exists an randomized embedding that maps a
point set under symmetric difference with respect to translation
into a metric space L1 with distortion O(log2 n).
• Assumption:
– Each point set has at most n elements and is in dimension d.
– Coordinates are integers of magnitude polynomial in n
• Distance Function: Symmetric Difference with respect to translation
<PΔQ> = min |(P + t)ΔQ|
t
• Target Metric: L1
x, y  R ,
d
d
x  y 1   xi  yi
i1
9
Outline of Algorithm
{3,6,10,14,22}
1 0 0 1 0 0 1 0 0 0 1
O(nlogn)
{101010, ..., 010100, …, 11101}
3 0 0 2 0 0 1 0
1. Transform d-dimension points into
1-d dimension points.
(Distortion: 1)
2. Reduce the domain size using a
linear hash function.
(Distortion: O(1))
3. Make invariant under translation.
(Distortion: O(log2n))
4. Reduce the target domain size
using a universal hash function.
(Distortion: O(1))
10
Translation Invariant
s
P=
1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0
ρ =4
{1101, 0000, 0010, 1100, 0001,
…
1010}
11
Intuition
hP 1 0 0 1 0 0 1 0 0 0 0
hQ 1 0 0 0 1 0 1 0 0 1 0
s
s
Φ2P={10,01,00,10,01,00,10,00,00,01,00}
one of probes hits mismatched positions, then the bit patterns
ΦIf
2Q={10,00,01,00,11,00,10,01,00,11,00}
generated may differ.
The probability that one of probes hits mismatched positions increases
Φ4when
P={1101,0000,0010,1100,0000,0001,
the probe size increases.
1000,0010,0101,0000,0010}
Φ4Q={1011,0100,0010,0101,1000,0011,
1100,0010,0100,1001,0000}
12
Relationship between ρ (probe size) and δ*
s  δ: estimated distance

ρ 
 2δ  δ*: original distance
ΦρPΔΦρQ
Expectation
2s
Unknown
Upper bound
>2s-2
Distance
of Invariants
???
s
s/2i
1
δ* / O(ln s)
ρ increases
δ*
2n
δ
13
Embedding
δ: estimated distance
δ*: original distance
ΦρPΔΦρQ
δ*
O(log s)
2s
Distance
of Invariants1
???
δ*
.5
20 21 22 …
2L
nP
H1 log 2n log 2Φ
i*
i
i
…
 ΦiQ
H 1
…
2H
…
2log 2n=2n
log 2Ln 
L 1
*i
* δ*
E

2

2

2

1

ΨP Ψ
PΨQΨQ

2

δ

2

δ

O
(log
n
)
δ


1 1
s
O(log s)
i

0
0
i 0
iH
ii0



δ

14
Build Time
The expensive operations are of building invariant and hashing for
large domain.
Building invariant : (# of Probes) * (# of Translations)
Trivial: O(s) * s = O(n log n) * O(n log n) = O(n2 log2 n)
Universal hash function:
(# of Elements) * (Matrix operation)
= (# of Elements) * (Input Size) * (Output Size)
Trivial: O(s) * O(s) * O(log s) = O(s2 log s) = O( n2 log3 n )
We can improve it to O( n log3 n ) if we merge two operations.
Surprise!!!
15
Merge Two Operations
P= 1 0 0 0 1 0 1 0 0 1 0
f
r0
rlog s
1
0 1 0

s
1
y0y1 y2 ys-1
1 0 1 0 1
…
…

Conv((r0  f), P)
…
H
Convolution can be computed in O(n log n) where n is the size of array
16
Main Result: Formal Statement
Given failure probability β, there exists a randomized embedding
from a point set P into a vector ΨP of dimension
O(n (log2n) log(1/β)) such that for any P, Q
(i) ΨP  ΨQ  2logn PΔQ
 1 
 PΔQ with prob. at least 1 - β
(ii) ΨP  ΨQ  

 17 log n 
This embedding can be computed in time O(n (log4n) log(1/β))
17
Open Problems
• Q1. Can we improve the distortion bound? currently O(log2 n)
Cormode & Muthukrishnan show how to embed a string
under edit distance with moves into L1 with O(log n log* n)
distortion.
• Q2. Can we derandomize the algorithm?
Cormode & Muthukrishnan’s algorithm is deterministic.
• Q3. Can we improve space/time complexities?
18
Other Extensions
•
Q1. Can we support a distance
measure (e.g., Hausdorff
distance that is robust to noisy
data)?
•
Q2. Can we handle other
transformation groups?
- integer scaling?
- integer scaling + translation?
- affine transformations over
finite vector spaces?
Point Pattern Similarity Searching:
• Distance Measure:
Symmetric Difference Distance
•
Error Model:
Outliers (but No Noise)
•
Transformation:
Translation
•
Restriction:
Coordinates are integral
19
Thank You!
20
Translation Invariant
P = {3,6,10,14,22}
h(x) = x mod s
hP =
(e.g. s = 11)
s
1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0
ρ =4
{1101, 0000, 0010, 1100, 0001,
…
1010}
ΦρP = {13,0,2,12,1,…,10}
h’(x) : (for simplicity, x mod 10)
Φ ρP =
2 0
0
1 0
2 0
1 0 0 0 0 0 0
0
1
2
3
4
5
6
7
8
9
21
Trial 1: Geometric Hashing for Translation
• Naïve Version:
- Space complexity is O( N n2 ) since the frame size is 1.
- With outliers in a query: # of queries will increase
• Adaptive Version:
To reduce space complexity, if store only c transformed sets, then
# of queries will increase.
•
Outliers may lead a false matching, thus they will increase the prob. of
the false positive.
22
Geometric Hashing with Outliers (delete)
Based on the outliers $r$ and the frame size $k$, the number of
queries will increase to get a correct result.
method 1. Pr[ choose a valid frame set] = ( 1 – r/n )^k
method 2. (r + 1) different trials ( deterministic)
method 3. pigeonhole theorem.
Pr[ choose a valid frame set] = 1-r/(n/k)
[Grimson&Huttenlocher 90] : Outliers lead a false matching and
increase the prob. of the false positive.
23
d-Dimension  1-Dimension
Let u be the maximum coordinate value of each point. Then, we can
map a d-dimensional point set to a 1-dimensional point set with
coordinates of size at most (3u)d. without changing the symmetric
difference distance under translation.
(5,3)
0 1 0 0 0
0 0 1 0 0
0 1 0 0 1
(1,1)
0 1 0 0 1 … 0 0 1 0 0 … 0 1 0 0 0 …
1
[6,15]
[21,30]
35
24
# of Primes & Collision Prob.
•Collision Probability
h(x) = x mod s where s is a prime number in Θ (n log n)
( where s is chosen uniformly at random )
For x != y
Pr[h(x) = h(y)] = Pr[(x mod s) = (y mod s)]
= Pr[(x-y) mod s = 0]
Since x, y Є Znc, |x – y| < nc.
Pr[h(x) = h(y)] < c/(# of primes) = 1/O(n)
• Prime Number Theorem
There exist O(m/log m) prime numbers in range between 1 and m.
25
Distance Distortion by Hashing
We can achieve o(1) distortion with the hash function which the
probability of collision is 1/O(n).
Note that the distance is always contracted due to collision.
26
Linear Hash Function (X)
• h(x) = x mod s
where s is a prime number
in Θ(n log n)
• Linearity
h( x + t ) = h(x) + h(t)
1 0 0 1 0 0 1 0 0 0 1
P = {3,6,10,14,22}
- translation
ΦρP = Φρ(P+t)
27
S
Distance Distortion by Hashing (X)
We can achieve o(1) distortion with the hash function which the
probability of collision is 1/O(n).
Note that the distance is always contracted due to collision.
28
Universal Hash Function for large domain
Since the maximum probe size is O(n log n), the input domain of
hash function is O(2O(n log n)). However, it has only θ(n log n)
elements.
•
H: 2s  2k
H(x) = R x + b (mod (2,2,…,2))
R: a random k x s matrix
b: k bits random row vector.
• Time Complexity:
For compute a value : O( k s ) = O( (log n) n log n ) = O( n log2 n )
For, all s (= O(n log n) ) , the time is O( n2 log3 n ).
29
Relationship between ρ and δ*
s

ρ 
 2δ 
ΦρPΔΦρQ
δ is a guess distance
δ* is an optimal distance
Expectation
2s
Unknown
Upper bound
>2s-2
???
s
s/2i
1
δ
O(ln s)
*
δ*
2n
δ
30
Effect of Hash Functions
s

ρ 
 2δ 
ΦρPΔΦρQ
???
h’ 2s
s
1
δ
O(log s)
*
h
δ*
2n
δ
31
Merge Two Operations using FFT & Convolution
П = random_probe( ρ, s )
For t = 1, …., s, x(t) = (hP + t)[П]
// make an invariant
For t = 1, …, s.
x’(t) = H x(t) + b ( mod (2,2,2,…,2) ) // H: O(log s) x ρ matrix
ΦρP[x’(t)]++
Time Complexity: O(s) * O(matrix multi) = O( s ) * O(s log s)
-----------------------------------------------------------------------H = [r1, r2, …, rO(log s)]’ // ri : a binary row bit vector
Hx(t) = [ r1 x(t), r2 x(t), r3 x(t), …, rO(logs) x(t)]’
ri x(t) = ri  (hP + t)[П] =  (hP + t)[П  ri]
[ri x(0), ri x(1), …, ri x(s)] = fliplr(hP)  [П  ri]
Time Complexity: O(log s) * O(convolution) = O( log s ) * O(s log s)
32
Build Time
Trivial running time
Ours
d-dimension
-> 1-dimension
O(dn)
O(dn)
Linear Hashing
O(n)
O(n)
Invariant under
Translation
O(n^2 log^2 n)
Universal Hashing
(due to the domain size,
we need to use matrix
multiplication )
O(n^2 log^4 n)
O( n log^3 n)
33
Download