CS 361A (Advanced Algorithms)

advertisement
CS 361A
(Advanced Data Structures and Algorithms)
Lecture 19 (Dec 5, 2005)
Nearest Neighbors:
Dimensionality Reduction and
Locality-Sensitive Hashing
Rajeev Motwani
CS 361A
1
Metric Space
• Metric Space (M,D)
– For points p,q in M, D(p,q) is distance from p to q
– only reasonable model for high-dimensional geometric space
• Defining Properties
– Reflexive: D(p,q) = 0 if and only if p=q
– Symmetric: D(p,q) = D(q,p)
– Triangle Inequality: D(p,q) is at most D(p,r)+D(r,q)
• Interesting Cases
– M  points in d-dimensional space
– D  Hamming or Euclidean Lp-norms
CS 361A
2
High-Dimensional Near Neighbors
• Nearest Neighbors Data Structure
– Given – N points P={p1, …, pN} in metric space (M,D)
– Queries – “Which point pP is closest to point q?”
– Complexity – Tradeoff preprocessing space with query time
• Applications
– vector quantization
– multimedia databases
– data mining
– machine learning
– …
CS 361A
3
Known Results
Query
Time
Storage
Technique
dN
dN
Brute-Force
2d log N
N2^d+1
Voronoi Diagram
Dobkin-Lipton 76
Dd/2 log N
Nd/2
Random Sampling
Clarkson 88
d5 log N
Nd
Combination
Meiser 93
logd-1 N
N logd-1 N
Parametric Search
Agarwal-Matousek 92
Paper
• Some expressions are approximate
• Bottom-line – exponential dependence on d
CS 361A
4
Approximate Nearest Neighbor
• Exact Algorithms
– Benchmark – brute-force needs space O(N), query time O(N)
– Known Results – exponential dependence on dimension
– Theory/Practice – no better than brute-force search
• Approximate Near-Neighbors
– Given – N points P={p1, …, pN} in metric space (M,D)
– Given – error parameter >0
– Goal – for query q and nearest-neighbor p, return r such that
D(q, r)  (1  ε)D(q, p)
• Justification
– Mapping objects to metric space is heuristic anyway
– Get tremendous performance improvement
CS 361A
5
Results for Approximate NN
Query Time
Storage
Technique
Paper
dd e-d
dN
Balanced Trees
Arya et al 94
d2 polylog(N,d)
N2d
Random Projection
Kleinberg 97
N
dN polylog(N,d)
log3 N
N1/^2
Search Trees +
Dimension Reduction
Indyk-Motwani 98
dN1/log2N
N1+1/log N
Locality-Sensitive
Hashing
Indyk-Motwani 98
External Memory
External Memory
Locality-Sensitive
Hashing
Gionis-IndykMotwani 99
• Will show main ideas of last 3 results
• Some expressions are approximate
CS 361A
6
Approximate r-Near Neighbors
• Given – N points P={p1,…,pN} in metric space (M,D)
• Given – error parameter >0, distance threshold r>0
• Query
– If no point p with D(q,p)<r, return FAILURE
– Else, return any p’ with D(q,p’)< (1+)r
• Application
– Solving Approximate Nearest Neighbor
– Assume maximum distance is R
– Run in parallel for
r  1, (1  ε), (1  ε)2 , (1  ε)3 ,, R
– Time/space – O(log R) overhead
– [Indyk-Motwani] – reduce to O(polylog n) overhead
CS 361A
7
Hamming Metric
• Hamming Space
– Points in M: bit-vectors {0,1}d (can generalize to {0,1,2,…,q}d)
– Hamming Distance: D(p,q) = # of positions where p,q differ
• Remarks
– Simplest high-dimensional setting
– Still useful in practice
– In theory, as hard (or easy) as Euclidean space
– Trivial in low dimensions
• Example
– Hypercube in d=3 dimensions
– {000, 001, 010, 011, 100, 101, 110, 111}
CS 361A
8
Dimensionality Reduction
• Overall Idea
– Map from high to low dimensions
– Preserve distances approximately
– Solve Nearest Neighbors in new space
– Performance improvement at cost of approximation error
• Mapping?
– Hash function family H = {H1, …, Hm}
– Each Hi: {0,1}d  {0,1}t with t<<d
– Pick HR from H uniformly at random
– Map each point in P using same HR
– Solve NN problem on HR(P) = {HR(p1), …, HR(pN)}
CS 361A
9
Reduction for Hamming Spaces
Theorem: For any r and small >0, there is hash family H such that
for any p,q and random HR H
D(p, q)  r

D(H R (p), H R (q))  (c  ε/20)t
D(p, q)  (1  ε)r

D(H R (p), H R (q))  (c  ε/10)t
with probability >1-, provided for some constant C,
C log 2/δ
t
ε2
(c  ε/20)t
r
c
CS 361A
b
a
b
(1 ε)r
c
a
(c  ε/10)t
10
Remarks
• For fixed threshold r, can distinguish between
– Near D(p,q) < r
– Far
D(p,q) > (1+ε)r
• For N points, need δ  N 2
• Yet, can reduce to O(log N)-dimensional space,
while approximately preserving distances
• Works even if points not known in advance
CS 361A
11
Hash Family
• Projection Function
– Let S be ordered, multiset of s indexes from {1,…,d}
– p|S:{0,1}d {0,1}s projects p into s-dimensional subspace
– Example
• d=5, p=01100
• s=3, S={2,2,4}  p|S = 110
• Choosing hash function HR in H
– Repeat for i=1,…,t
• Pick Si randomly (with replacement) from {1…d}
• Pick random hash function fi:{0,1}s {0,1}
• hi(p)=fi(p|Si)
– HR(p) = (h1(p), h2(p),…,ht(p))
• Remark – note similarity to Bloom Filters
CS 361A
12
1
p
Illustration of Hashing
. . . . .
0
1
1
0
0
0
1
0
1
d
0
p|S1
p|St
1
1
0
0
...
. . . . .
1
s
1
HR(p)
0
0
...
0
s
ft
f1
CS 361A
0
0
h1(p)
1
1
...
0
ht(p)
13
Analysis I
• Choose random index-set S
• Claim: For any p,q
 D(p, q) 
Pr[ p S  q S]  1 

d 

s
• Why?
– p,q differ in D(p,q) bit positions
– Need all s indexes of S to avoid these positions
– Sampling with replacement from {1, …,d}
CS 361A
14
Analysis II
• Choose s=d/r
• Since 1-x<e-x for |x|<1, we obtain
s
 D(p, q) 
 D(p, q)/r
Pr[ p S  q S]  1 

e

d 

• Thus
D(p, q)  r
D(p, q)  (1  ε)r
CS 361A
 Pr[ p S  q S]  e 1
 Pr[ p S  q S]  e 1  ε/3
15
Analysis III
• Recall hi(p)=fi(p|Si)
• Thus
Pr[h i (p)  h i (q)]  (1  Pr[ p Si  q Si ]) 1/2
 Pr[ p Si  q Si ]  0
 (1  Pr[ p Si  q Si ])/2
• Choosing c= ½ (1-e-1)
D(p, q)  r  Pr[h i (p)  h i (q)]  (1  e -1 ) 1/2  c
D(p, q)  (1  ε)r  Pr[h i (p)  h i (q)]  (1  e -1  ε/3) 1/2  c  ε/6
CS 361A
16
Analysis IV
• Recall HR(p)=(h1(p),h2(p),…,ht(p))
• D(HR(p),HR(q)) = number of i’s where hi(p), hi(q) differ
• By linearity of expectations
E[D(H R (p), H R (q))]  iPr[h i (p)  h i (q)]
 t  Pr[h i (p)  h i (q)]
• Theorem almost proved
D(p, q)  r  E[D(H R (p), H R (q))]  ct
D(p, q)  (1  ε)r  E[D(H R (p), H R (q))]  (c  ε/6)t
• For high probability bound, need Chernoff Bound
CS 361A
17
Chernoff Bound
• Consider Bernoulli random variables X1,X2, …, Xn
– Values are 0-1
– Pr[Xi=1] = x and Pr[Xi=0] = 1-x
• Define X = X1+X2+…+Xn with E[X]=nx
• Theorem: For independent X1,…, Xn, for any 0<<1,
Pr X - nx  βnx   2e
P
CS 361A
 β 2 nx/3
2nx
nx
X
18
Analysis V
• Define
– Xi=0 if hi(p)=hi(q), and 1 otherwise
– n=t
– Then X = X1+X2+…+Xt = D(HR(p),HR(q))
• Case 1 [D(p,q)<r  x=c]
Pr[X  (c  ε/20)t]  Pr[ X  tx  εtc/20]  2e
 ( /20)2 tc/3
• Case 2 [D(p,q)>(1+ε)r  x=c+ε/6]
Pr[X  (c  ε/10)t]  Pr[ X  tx  εtc/20]  2e
 (/20)2 tc/3
• Observe – sloppy bounding of constants in Case 2
CS 361A
19
Putting it all together
• Recall
t
C log 2/δ
ε2
• Thus, error probability
2e ( / 20)
2
tc/3
 2e
(cC/1200)log
2
δ
• Choosing C=1200/c
2e
(cC/1200)log
2
δ
 2e
log
2
δ
δ
• Theorem is proved!!
CS 361A
20
Algorithm I
• Set error probability δ  1/poly(N)
 t  O(ε 2log N)
• Select hash HR and map points p  HR(p)
• Processing query q
– Compute HR(q)
– Find nearest neighbor HR(p) for HR(q)
– If D(p, q)  (1  ε)r then return p, else FAILURE
• Remarks
-2
– Brute-force for finding HR(p) implies query time O(ε N log N)
– Need another approach for lower dimensions
CS 361A
21
Algorithm II
• Fact – Exact nearest neighbors in {0,1}t requires
– Space O(2t)
– Query time O(t)
• How?
– Precompute/store answers to all queries
– Number of possible queries is 2t
• Since t  O(ε 2log N)
• Theorem – In Hamming space {0,1}d, can solve
approximate nearest neighbor with:
O(1/ε 2 )
– Space N
2
O(ε
log N)
– Query time
CS 361A
22
Different Metric
• Many applications have “sparse” points
– Many dimensions but few 1’s
– Example – pointsdocuments, dimensionswords
– Better to view as “sets”
• Previous approach would require large s
• For sets A,B, define sim(A, B) 
AB
AB
• Observe
– A=B  sim(A,B)=1
– A,B disjoint  sim(A,B)=0
• Question – Handling D(A,B)=1-sim(A,B) ?
CS 361A
23
Min-Hash
• Random permutations p1,…,pt of universe (dimensions)
• Define mapping hj(A)=mina in A pj(a)
• Fact: Pr[hj(A)= hj(B)] = sim(A,B)
• Proof? – already seen!!
• Overall hash-function
HR(A) = (h1(A), h2(A),…,ht(A))
CS 361A
24
Min-Hash Analysis
• Select
C log 1/δ
t
ε2
• Hamming Distance
– D(HR(A),HR(B))  number of j’s such that h j (A)  h j (B)
• Theorem For any A,B,
Pr D(H(A), H(B)) - (1 - sim(A, B))t  εt   δ
• Proof? – Exercise (apply Chernoff Bound)
• Obtain – ANN algorithm similar to earlier result
CS 361A
25
Generalization
• Goal
– abstract technique used for Hamming space
– enable application to other metric spaces
– handle Dynamic ANN
• Dynamic Approximate r-Near Neighbors
– Fix – threshold r
– Query – if any point within distance r of q, return any point
within distance (1 ε)r
– Allow insertions/deletions of points in P
• Recall – earlier method required preprocessing all
possible queries in hash-range-space…
CS 361A
26
Locality-Sensitive Hashing
• Fix – metric space (M,D), threshold r, error
ε0
• Choose – probability parameters Q1 > Q2>0
Definition – Hash family H={h:MS} for (M,D) is called
(r,. ε, Q1 , Q2 ) -sensitive, if for random h and for any p,q in M
D(p, q)  r

D(p, q)  (1  ε)r 
Prh(q)  h(p)   Q1
Prh(q)  h(p)   Q 2
• Intuition
CS 361A
– p,q are near  likely to collide
– p,q are far  unlikely to collide
27
Examples
• Hamming Space M={0,1}d
– point p=b1…bd
– H = {hi(b1…bd)=bi, for i=1…d}
– sampling one bit at random
– Pr[hi(q)=hi(p)] = 1 – D(p,q)/d
• Set Similarity D(A,B) = 1 – sim(A,B)
– Recall sim(A, B) 
– H=
AB
AB
{h π : h π (A)  min aA π(A)}
– Pr[h(A)=h(B)] = 1 – D(A,B)
CS 361A
28
Multi-Index Hashing
• Overall Idea
– Fix LSH family H
– Boost Q1, Q2 gap by defining G = H k
– Using G, each point hashes into l buckets
• Intuition
– r-near neighbors likely to collide
– few non-near pairs in any bucket
• Define
– G = { g | g(p) = h1(p)h2(p)…hk(p) }
– Hamming metric  sample k random bits
CS 361A
29
Example (l=4)
h1
g1
…… hk
p
q
g2
g3
g4
CS 361A
r
30
Overall Scheme
• Preprocessing
– Prepare hash table for range of G
– Select l hash functions g1, g2, …, gl
• Insert(p) – add p to buckets g1(p), g2(p), …, gl(p)
• Delete(p) – remove p from buckets g1(p), g2(p), …, gl(p)
• Query(q)
– Check buckets g1(q), g2(q), …, gl(q)
– Report nearest of (say) first 3l points
• Complexity
– Assume – computing D(p,q) needs O(d) time
– Assume – storing p needs O(d) space
– Insert/Delete/Query Time – O(dlk)
– Preprocessing/Storage – O(dN+Nlk)
CS 361A
31
Collision Probability vs. Distance
1
Q1
Q2
0
r
r
CS 361A
Pcoll  Q
r
r
k,l
Pcoll
 1  (1  Q k )l
32
Multi-Index versus Error
• Set l=Nz where z  log 1/Q1
log 1/Q 2
Theorem For l=Nz, any query returns r-near neighbor
correctly with probability at least 1/6.
• Consequently (ignoring k=O(log N) factors)
– Time O(dNz)
– Space O(N1+z)
1
– Hamming Metric  z 
lε
CS 361A
– Boost Probability – use several parallel hash-tables
33
Analysis
• Define (for fixed query q)
– p* – any point with D(q,p*) < r
– FAR(q) – all p with D(q,p) > (1+ ε )r
– BUCKET(q,j) – all p with gj(p) = gj(q)
– Event Esize:  j1 FAR(q)  BUCKET(q, j)  3l
(query cost bounded by O(dl))
l
– Event ENN: gj(p*) = gj(q) for some j
(nearest point in l buckets is r-near neighbor)
• Analysis
– Show: Pr[Esize] = x > 2/3 and Pr[ENN] = y > 1/2
– Thus: Pr[not(Esize & ENN)] < (1-x) + (1-y) < 5/6
CS 361A
34
Analysis – Bad Collisions
• Choose k  log 1/Q N
2
1
• Fact p  FAR(q)  Prp  BUCKET(q, j)   Q 
N
• Clearly
1
E FAR(q)  BUCKET(q, j)   N   1
N
k
2


 E  j1 FAR(q)  BUCKET(q, j)  l
l
• Markov Inequality – Pr[X>r.E[X]]<1/r, for X>0


• Lemma 1 PrEsize   Pr  j1 FAR(q)  BUCKET(q, j)  3l 
CS 361A
l
1
3
35
Analysis – Good Collisions


• Observe  Pr g j (p*)  g j (q)  Q  Q
k
1
N

log1/Q2 N
1
log1/Q1
log1/Q2
 N z
• Since l=nz  PrE NN   1  1  Prg j (p*)  g j (q)  l

 1 1 N
 1
z

Nz
1
e
• Lemma 2 Pr[ENN] >1/2
CS 361A
36
Euclidean Norms
• Recall
– x=(x1, x2, …, xd) and y=(y1, y2, …, yd) in Rd
– L1-norm
x  y 1  i 1 x i  yi
d
– Lp-norm (for p>1)
xy p 
CS 361A
p
p


x

y
i1 i i
d
37
Extension to L1-Norm
• Round coordinates to {1,…M}
• Embed L1-{1,…,M}d into Hamming-{0,1}dM
• Unary Mapping
(x 1 ,, x d )  1
10




0



0  1

10
x1
M  x1
xd
Mxd
(y1 ,, y d )  1


 0  1
10



10



0
y1
M  y1
yd
M  yd
• Apply algorithm for Hamming Spaces
– Error due to rounding of 1/M  M  Ω(1/ε)
– Space-Time Overhead due to mapping of d  dM
CS 361A
38
Extension to L2-Norm
• Observe
– Little difference in L1-norm and L2-norm for high d
– Additional error is small
• More generally – Lp, for 1  p  2
– [Figiel et al 1977, Johnson-Schechtman 1982]
– Can embed Lp into L1
– Dimensions d  O(d)
– Distances preserved within factor (1+a)
– Key Idea – random rotation of space
CS 361A
39
Improved Bounds
• [Indyk-Motwani 1998]
– For any Lp-norm
– Query Time – O(log3 N)
– Space – N
O(1/ε 2 )
• Problem – impractical
• Today – only a high-level sketch
CS 361A
40
Better Reduction
• Recall
– Reduced Approximate Nearest Neighbors to
Approximate r-Near Neighbors
– Space/Time Overhead – O(log R)
– R = max distance in metric space
• Ring-Cover Trees
– Removed dependence on R
– Reduced overhead to O(polylog N)
CS 361A
41
Approximate r-Near Neighbors
• Idea
– Impose regular-grid on Rd
– Decompose into cubes of side length s
– Label cubes with points at distance <r
• Data Structure
– Query q – determine cube containing q
– Cube labels – candidate r-near neighbors
• Goals
– Small s  lower error
– Fewer cubes  smaller storage
CS 361A
42
p1
p2
p3
CS 361A
43
Grid Analysis
• Assume r=1
• Choose s 
ε
d
• Cube Diameter =
d  s2  ε
• Number of cubes = Vol d ( d /ε )  O(ε ) d
Theorem – For any Lp-norm, can solve Approx
r-Near Neighbor using
– Space – O(dNε  d )
– Time – O(d)
CS 361A
44
Dimensionality Reduction
[Johnson-Lindenstraus 84, Frankl-Maehara 88]
For p  [1,2], can map points in P into
subspace of dimension O(ε 2logN) while
preserving all inter-point distances to within a
factor 1 ε
• Proof idea – project onto random lines
• Result for NN
1/ε 2
– Space – O(dN
)
– Time – O(polylog N)
CS 361A
45
References
• Approximate Nearest Neighbors: Towards
Removing the Curse of Dimensionality
P. Indyk and R. Motwani
STOC 1998
• Similarity Search in High Dimensions via Hashing
A. Gionis, P. Indyk, and R. Motwani
VLDB 1999
CS 361A
46
Download