CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive Hashing Rajeev Motwani CS 361A 1 Metric Space • Metric Space (M,D) – For points p,q in M, D(p,q) is distance from p to q – only reasonable model for high-dimensional geometric space • Defining Properties – Reflexive: D(p,q) = 0 if and only if p=q – Symmetric: D(p,q) = D(q,p) – Triangle Inequality: D(p,q) is at most D(p,r)+D(r,q) • Interesting Cases – M points in d-dimensional space – D Hamming or Euclidean Lp-norms CS 361A 2 High-Dimensional Near Neighbors • Nearest Neighbors Data Structure – Given – N points P={p1, …, pN} in metric space (M,D) – Queries – “Which point pP is closest to point q?” – Complexity – Tradeoff preprocessing space with query time • Applications – vector quantization – multimedia databases – data mining – machine learning – … CS 361A 3 Known Results Query Time Storage Technique dN dN Brute-Force 2d log N N2^d+1 Voronoi Diagram Dobkin-Lipton 76 Dd/2 log N Nd/2 Random Sampling Clarkson 88 d5 log N Nd Combination Meiser 93 logd-1 N N logd-1 N Parametric Search Agarwal-Matousek 92 Paper • Some expressions are approximate • Bottom-line – exponential dependence on d CS 361A 4 Approximate Nearest Neighbor • Exact Algorithms – Benchmark – brute-force needs space O(N), query time O(N) – Known Results – exponential dependence on dimension – Theory/Practice – no better than brute-force search • Approximate Near-Neighbors – Given – N points P={p1, …, pN} in metric space (M,D) – Given – error parameter >0 – Goal – for query q and nearest-neighbor p, return r such that D(q, r) (1 ε)D(q, p) • Justification – Mapping objects to metric space is heuristic anyway – Get tremendous performance improvement CS 361A 5 Results for Approximate NN Query Time Storage Technique Paper dd e-d dN Balanced Trees Arya et al 94 d2 polylog(N,d) N2d Random Projection Kleinberg 97 N dN polylog(N,d) log3 N N1/^2 Search Trees + Dimension Reduction Indyk-Motwani 98 dN1/log2N N1+1/log N Locality-Sensitive Hashing Indyk-Motwani 98 External Memory External Memory Locality-Sensitive Hashing Gionis-IndykMotwani 99 • Will show main ideas of last 3 results • Some expressions are approximate CS 361A 6 Approximate r-Near Neighbors • Given – N points P={p1,…,pN} in metric space (M,D) • Given – error parameter >0, distance threshold r>0 • Query – If no point p with D(q,p)<r, return FAILURE – Else, return any p’ with D(q,p’)< (1+)r • Application – Solving Approximate Nearest Neighbor – Assume maximum distance is R – Run in parallel for r 1, (1 ε), (1 ε)2 , (1 ε)3 ,, R – Time/space – O(log R) overhead – [Indyk-Motwani] – reduce to O(polylog n) overhead CS 361A 7 Hamming Metric • Hamming Space – Points in M: bit-vectors {0,1}d (can generalize to {0,1,2,…,q}d) – Hamming Distance: D(p,q) = # of positions where p,q differ • Remarks – Simplest high-dimensional setting – Still useful in practice – In theory, as hard (or easy) as Euclidean space – Trivial in low dimensions • Example – Hypercube in d=3 dimensions – {000, 001, 010, 011, 100, 101, 110, 111} CS 361A 8 Dimensionality Reduction • Overall Idea – Map from high to low dimensions – Preserve distances approximately – Solve Nearest Neighbors in new space – Performance improvement at cost of approximation error • Mapping? – Hash function family H = {H1, …, Hm} – Each Hi: {0,1}d {0,1}t with t<<d – Pick HR from H uniformly at random – Map each point in P using same HR – Solve NN problem on HR(P) = {HR(p1), …, HR(pN)} CS 361A 9 Reduction for Hamming Spaces Theorem: For any r and small >0, there is hash family H such that for any p,q and random HR H D(p, q) r D(H R (p), H R (q)) (c ε/20)t D(p, q) (1 ε)r D(H R (p), H R (q)) (c ε/10)t with probability >1-, provided for some constant C, C log 2/δ t ε2 (c ε/20)t r c CS 361A b a b (1 ε)r c a (c ε/10)t 10 Remarks • For fixed threshold r, can distinguish between – Near D(p,q) < r – Far D(p,q) > (1+ε)r • For N points, need δ N 2 • Yet, can reduce to O(log N)-dimensional space, while approximately preserving distances • Works even if points not known in advance CS 361A 11 Hash Family • Projection Function – Let S be ordered, multiset of s indexes from {1,…,d} – p|S:{0,1}d {0,1}s projects p into s-dimensional subspace – Example • d=5, p=01100 • s=3, S={2,2,4} p|S = 110 • Choosing hash function HR in H – Repeat for i=1,…,t • Pick Si randomly (with replacement) from {1…d} • Pick random hash function fi:{0,1}s {0,1} • hi(p)=fi(p|Si) – HR(p) = (h1(p), h2(p),…,ht(p)) • Remark – note similarity to Bloom Filters CS 361A 12 1 p Illustration of Hashing . . . . . 0 1 1 0 0 0 1 0 1 d 0 p|S1 p|St 1 1 0 0 ... . . . . . 1 s 1 HR(p) 0 0 ... 0 s ft f1 CS 361A 0 0 h1(p) 1 1 ... 0 ht(p) 13 Analysis I • Choose random index-set S • Claim: For any p,q D(p, q) Pr[ p S q S] 1 d s • Why? – p,q differ in D(p,q) bit positions – Need all s indexes of S to avoid these positions – Sampling with replacement from {1, …,d} CS 361A 14 Analysis II • Choose s=d/r • Since 1-x<e-x for |x|<1, we obtain s D(p, q) D(p, q)/r Pr[ p S q S] 1 e d • Thus D(p, q) r D(p, q) (1 ε)r CS 361A Pr[ p S q S] e 1 Pr[ p S q S] e 1 ε/3 15 Analysis III • Recall hi(p)=fi(p|Si) • Thus Pr[h i (p) h i (q)] (1 Pr[ p Si q Si ]) 1/2 Pr[ p Si q Si ] 0 (1 Pr[ p Si q Si ])/2 • Choosing c= ½ (1-e-1) D(p, q) r Pr[h i (p) h i (q)] (1 e -1 ) 1/2 c D(p, q) (1 ε)r Pr[h i (p) h i (q)] (1 e -1 ε/3) 1/2 c ε/6 CS 361A 16 Analysis IV • Recall HR(p)=(h1(p),h2(p),…,ht(p)) • D(HR(p),HR(q)) = number of i’s where hi(p), hi(q) differ • By linearity of expectations E[D(H R (p), H R (q))] iPr[h i (p) h i (q)] t Pr[h i (p) h i (q)] • Theorem almost proved D(p, q) r E[D(H R (p), H R (q))] ct D(p, q) (1 ε)r E[D(H R (p), H R (q))] (c ε/6)t • For high probability bound, need Chernoff Bound CS 361A 17 Chernoff Bound • Consider Bernoulli random variables X1,X2, …, Xn – Values are 0-1 – Pr[Xi=1] = x and Pr[Xi=0] = 1-x • Define X = X1+X2+…+Xn with E[X]=nx • Theorem: For independent X1,…, Xn, for any 0<<1, Pr X - nx βnx 2e P CS 361A β 2 nx/3 2nx nx X 18 Analysis V • Define – Xi=0 if hi(p)=hi(q), and 1 otherwise – n=t – Then X = X1+X2+…+Xt = D(HR(p),HR(q)) • Case 1 [D(p,q)<r x=c] Pr[X (c ε/20)t] Pr[ X tx εtc/20] 2e ( /20)2 tc/3 • Case 2 [D(p,q)>(1+ε)r x=c+ε/6] Pr[X (c ε/10)t] Pr[ X tx εtc/20] 2e (/20)2 tc/3 • Observe – sloppy bounding of constants in Case 2 CS 361A 19 Putting it all together • Recall t C log 2/δ ε2 • Thus, error probability 2e ( / 20) 2 tc/3 2e (cC/1200)log 2 δ • Choosing C=1200/c 2e (cC/1200)log 2 δ 2e log 2 δ δ • Theorem is proved!! CS 361A 20 Algorithm I • Set error probability δ 1/poly(N) t O(ε 2log N) • Select hash HR and map points p HR(p) • Processing query q – Compute HR(q) – Find nearest neighbor HR(p) for HR(q) – If D(p, q) (1 ε)r then return p, else FAILURE • Remarks -2 – Brute-force for finding HR(p) implies query time O(ε N log N) – Need another approach for lower dimensions CS 361A 21 Algorithm II • Fact – Exact nearest neighbors in {0,1}t requires – Space O(2t) – Query time O(t) • How? – Precompute/store answers to all queries – Number of possible queries is 2t • Since t O(ε 2log N) • Theorem – In Hamming space {0,1}d, can solve approximate nearest neighbor with: O(1/ε 2 ) – Space N 2 O(ε log N) – Query time CS 361A 22 Different Metric • Many applications have “sparse” points – Many dimensions but few 1’s – Example – pointsdocuments, dimensionswords – Better to view as “sets” • Previous approach would require large s • For sets A,B, define sim(A, B) AB AB • Observe – A=B sim(A,B)=1 – A,B disjoint sim(A,B)=0 • Question – Handling D(A,B)=1-sim(A,B) ? CS 361A 23 Min-Hash • Random permutations p1,…,pt of universe (dimensions) • Define mapping hj(A)=mina in A pj(a) • Fact: Pr[hj(A)= hj(B)] = sim(A,B) • Proof? – already seen!! • Overall hash-function HR(A) = (h1(A), h2(A),…,ht(A)) CS 361A 24 Min-Hash Analysis • Select C log 1/δ t ε2 • Hamming Distance – D(HR(A),HR(B)) number of j’s such that h j (A) h j (B) • Theorem For any A,B, Pr D(H(A), H(B)) - (1 - sim(A, B))t εt δ • Proof? – Exercise (apply Chernoff Bound) • Obtain – ANN algorithm similar to earlier result CS 361A 25 Generalization • Goal – abstract technique used for Hamming space – enable application to other metric spaces – handle Dynamic ANN • Dynamic Approximate r-Near Neighbors – Fix – threshold r – Query – if any point within distance r of q, return any point within distance (1 ε)r – Allow insertions/deletions of points in P • Recall – earlier method required preprocessing all possible queries in hash-range-space… CS 361A 26 Locality-Sensitive Hashing • Fix – metric space (M,D), threshold r, error ε0 • Choose – probability parameters Q1 > Q2>0 Definition – Hash family H={h:MS} for (M,D) is called (r,. ε, Q1 , Q2 ) -sensitive, if for random h and for any p,q in M D(p, q) r D(p, q) (1 ε)r Prh(q) h(p) Q1 Prh(q) h(p) Q 2 • Intuition CS 361A – p,q are near likely to collide – p,q are far unlikely to collide 27 Examples • Hamming Space M={0,1}d – point p=b1…bd – H = {hi(b1…bd)=bi, for i=1…d} – sampling one bit at random – Pr[hi(q)=hi(p)] = 1 – D(p,q)/d • Set Similarity D(A,B) = 1 – sim(A,B) – Recall sim(A, B) – H= AB AB {h π : h π (A) min aA π(A)} – Pr[h(A)=h(B)] = 1 – D(A,B) CS 361A 28 Multi-Index Hashing • Overall Idea – Fix LSH family H – Boost Q1, Q2 gap by defining G = H k – Using G, each point hashes into l buckets • Intuition – r-near neighbors likely to collide – few non-near pairs in any bucket • Define – G = { g | g(p) = h1(p)h2(p)…hk(p) } – Hamming metric sample k random bits CS 361A 29 Example (l=4) h1 g1 …… hk p q g2 g3 g4 CS 361A r 30 Overall Scheme • Preprocessing – Prepare hash table for range of G – Select l hash functions g1, g2, …, gl • Insert(p) – add p to buckets g1(p), g2(p), …, gl(p) • Delete(p) – remove p from buckets g1(p), g2(p), …, gl(p) • Query(q) – Check buckets g1(q), g2(q), …, gl(q) – Report nearest of (say) first 3l points • Complexity – Assume – computing D(p,q) needs O(d) time – Assume – storing p needs O(d) space – Insert/Delete/Query Time – O(dlk) – Preprocessing/Storage – O(dN+Nlk) CS 361A 31 Collision Probability vs. Distance 1 Q1 Q2 0 r r CS 361A Pcoll Q r r k,l Pcoll 1 (1 Q k )l 32 Multi-Index versus Error • Set l=Nz where z log 1/Q1 log 1/Q 2 Theorem For l=Nz, any query returns r-near neighbor correctly with probability at least 1/6. • Consequently (ignoring k=O(log N) factors) – Time O(dNz) – Space O(N1+z) 1 – Hamming Metric z lε CS 361A – Boost Probability – use several parallel hash-tables 33 Analysis • Define (for fixed query q) – p* – any point with D(q,p*) < r – FAR(q) – all p with D(q,p) > (1+ ε )r – BUCKET(q,j) – all p with gj(p) = gj(q) – Event Esize: j1 FAR(q) BUCKET(q, j) 3l (query cost bounded by O(dl)) l – Event ENN: gj(p*) = gj(q) for some j (nearest point in l buckets is r-near neighbor) • Analysis – Show: Pr[Esize] = x > 2/3 and Pr[ENN] = y > 1/2 – Thus: Pr[not(Esize & ENN)] < (1-x) + (1-y) < 5/6 CS 361A 34 Analysis – Bad Collisions • Choose k log 1/Q N 2 1 • Fact p FAR(q) Prp BUCKET(q, j) Q N • Clearly 1 E FAR(q) BUCKET(q, j) N 1 N k 2 E j1 FAR(q) BUCKET(q, j) l l • Markov Inequality – Pr[X>r.E[X]]<1/r, for X>0 • Lemma 1 PrEsize Pr j1 FAR(q) BUCKET(q, j) 3l CS 361A l 1 3 35 Analysis – Good Collisions • Observe Pr g j (p*) g j (q) Q Q k 1 N log1/Q2 N 1 log1/Q1 log1/Q2 N z • Since l=nz PrE NN 1 1 Prg j (p*) g j (q) l 1 1 N 1 z Nz 1 e • Lemma 2 Pr[ENN] >1/2 CS 361A 36 Euclidean Norms • Recall – x=(x1, x2, …, xd) and y=(y1, y2, …, yd) in Rd – L1-norm x y 1 i 1 x i yi d – Lp-norm (for p>1) xy p CS 361A p p x y i1 i i d 37 Extension to L1-Norm • Round coordinates to {1,…M} • Embed L1-{1,…,M}d into Hamming-{0,1}dM • Unary Mapping (x 1 ,, x d ) 1 10 0 0 1 10 x1 M x1 xd Mxd (y1 ,, y d ) 1 0 1 10 10 0 y1 M y1 yd M yd • Apply algorithm for Hamming Spaces – Error due to rounding of 1/M M Ω(1/ε) – Space-Time Overhead due to mapping of d dM CS 361A 38 Extension to L2-Norm • Observe – Little difference in L1-norm and L2-norm for high d – Additional error is small • More generally – Lp, for 1 p 2 – [Figiel et al 1977, Johnson-Schechtman 1982] – Can embed Lp into L1 – Dimensions d O(d) – Distances preserved within factor (1+a) – Key Idea – random rotation of space CS 361A 39 Improved Bounds • [Indyk-Motwani 1998] – For any Lp-norm – Query Time – O(log3 N) – Space – N O(1/ε 2 ) • Problem – impractical • Today – only a high-level sketch CS 361A 40 Better Reduction • Recall – Reduced Approximate Nearest Neighbors to Approximate r-Near Neighbors – Space/Time Overhead – O(log R) – R = max distance in metric space • Ring-Cover Trees – Removed dependence on R – Reduced overhead to O(polylog N) CS 361A 41 Approximate r-Near Neighbors • Idea – Impose regular-grid on Rd – Decompose into cubes of side length s – Label cubes with points at distance <r • Data Structure – Query q – determine cube containing q – Cube labels – candidate r-near neighbors • Goals – Small s lower error – Fewer cubes smaller storage CS 361A 42 p1 p2 p3 CS 361A 43 Grid Analysis • Assume r=1 • Choose s ε d • Cube Diameter = d s2 ε • Number of cubes = Vol d ( d /ε ) O(ε ) d Theorem – For any Lp-norm, can solve Approx r-Near Neighbor using – Space – O(dNε d ) – Time – O(d) CS 361A 44 Dimensionality Reduction [Johnson-Lindenstraus 84, Frankl-Maehara 88] For p [1,2], can map points in P into subspace of dimension O(ε 2logN) while preserving all inter-point distances to within a factor 1 ε • Proof idea – project onto random lines • Result for NN 1/ε 2 – Space – O(dN ) – Time – O(polylog N) CS 361A 45 References • Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality P. Indyk and R. Motwani STOC 1998 • Similarity Search in High Dimensions via Hashing A. Gionis, P. Indyk, and R. Motwani VLDB 1999 CS 361A 46