Ryan O'Donnell (CMU) Yi Wu (CMU, IBM) Yuan Zhou (CMU) Locality Sensitive Hashing h: H: objects [Indyk-Motwani '98] sketches family of hash functions h s.t. “similar” objects collide w/ high prob. “dissimilar” objects collide w/ low prob. Abbreviated history Min-wise hash functions [Broder '98] A 0 1 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 B Jaccard similarity: | A B | | A B | Invented simple H s.t. Pr [h(A) = h(B)] = Indyk-Motwani '98 Defined LSH. Invented very simple H good for {0, 1}d under Hamming distance. Showed good LSH implies good nearest-neighbor-search data structs. Charikar '02, STOC Proposed alternate H (“simhash”) for Jaccard similarity. Patented by Google . Many papers about LSH Practice Theory Free code base [AI’04] [Broder ’97] Sequence comparison in bioinformatics Association-rule finding in data mining [Indyk–Motwani ’98] [Gionis–Indyk–Motwani ’98] [Charikar ’02] [Datar–Immorlica– –Indyk–Mirrokni ’04] Collaborative filtering [Motwani–Naor–Panigrahi ’06] Clustering nouns by meaning in NLP [Andoni–Indyk ’06] [Tenesawa–Tanaka ’07] Pose estimation in vision [Andoni–Indyk ’08, CACM] ••• [Neylon ’10] Given: Goal: (X, dist), r > 0, distance space “radius” c>1 “approx factor” Family H of functions X → S (S can be any finite set) s.t. ∀ x, y ∈ X, dist ( x, y ) ≤ r dist ( x, y ) ≥ cr Pr [h( x) h( y )] .25 .5 ≥q pρ.1 Pr [h( x) h( y )] ≤q h~ H h~ H dist ( x, y ) r Theorem Pr [h( x) h( y )] q h~ H dist ( x, y ) cr Pr [h( x) h( y )] q [IM’98, GIM’98] h~ H Given LSH family for (X, dist), can solve “(r,cr)-near-neighbor search” for n points with data structure of size: query time: O(n1+ρ) Õ(nρ) hash fcn evals. dist ( x, y ) r Example Pr [h( x) h( y )] q h~ H dist ( x, y ) cr Pr [h( x) h( y )] q X = {0,1}d, dist = Hamming h~ H r = εd, c=5 0 1 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 [IM’98] H= { h1, h2, …, hd }, hi(x) = xi “output a random coord.” dist ≤ εd or ≥5εd Analysis dist ( x, y) d dist ( x, y) 5d Pr [h( x) h( y)] 1 h~ H = qρ Pr [h( x) h( y)] 1 5 = q h~ H (1 − 5ε)1/5 ≈ 1 − ε. ∴ ρ ≈ 1/5 (1 − 5ε)1/5 ≤ 1 − ε. ∴ ρ ≤ 1/5 In general, achieves ρ ≤ 1/c, ∀c (∀r). Optimal upper bound ( {0, 1}d, Ham ), S ≝ {0, 1}d ∪ {✔}, hab(x) = dist ( x, y ) ≤ r dist ( x, y ) ≥ cr r > 0, c > 1. H ≝ {hab : dist(a,b) ≤ r} ✔ if x = a or x = b x otherwise .5positive .1 .01 Pr [h( x) h( y )] => 0.0001 h~ H Pr [h( x) h( y )] = 0 h~ H The End. Any questions? Wait, what? Theorem [IM’98, GIM’98] Given LSH family for (X, dist), can solve “(r,cr)-near-neighbor search” for n points with data structure of size: query time: O(n1+ρ) Õ(nρ) hash fcn evals. Wait, what? Theorem [IM’98, GIM’98] Given LSH family for (X, dist), can solve “(r,cr)-near-neighbor search” for n points with data structure of size: query time: O(n1+ρ) Õ(nρ) hash fcn evals. q ≥ n-o(1) ("not tiny") More results For Rd with ℓp-distance: 1 p c when p = 1, 0 < p < 1, p = 2 [IM’98] [DIIM’04] [AI’06] For Jaccard similarity: ρ ≤ 1/c [Bro’98] For {0,1}d with Hamming distance: [MNP’06] .462 c immediately −od(1) (assuming q ≥ 2−o(d)) .462 p c for ℓp-distance Our Theorem For {0,1}d with Hamming distance: (∃ r s.t.) immediately 1 c 1 p c −od(1) (assuming q ≥ 2−o(d)) for ℓp-distance Proof also yields ρ ≥ 1/c for Jaccard. Proof Proof: Noise-stability is log-convex. Proof: A definition, and two lemmas. Definition: Noise stability at -т e Fix any arbitrary function h : {0,1}d → S. Pick x ∈ {0,1}d at random: x= 0 1 1 1 0 0 1 0 0 h(x) = s Flip each bit w.p. (1-e-2т)/2 independently y= 0 def: 0 1 1 0 0 1 K h ( ) Pr[h( x) h( y)] x~ y 1 0 h(y) = s’ Lemma 1: For x τ dist(x, y) = (1 e 2 )d / 2 ≈ d y, o(d) w.v.h.p. when τ ≪ 1. Proof: Chernoff bound and Taylor expansion. Lemma 2: Kh(τ) is a log-convex function of τ. τ (for any h) 0 log Kh(τ) Proof: Fourier analysis of Boolean functions. 1 d Theorem: LSH for {0,1} requires od (1) . c Proof: Say H is an LSH family for {0,1}d with params (εd + o(d), cεd - o(d), r (c − o(1)) r def: K H ( ) E [K h ( )] h~ H E [ Pr[h( x) h( y)]] h~ H x~ y E [ Pr [h( x) h( y)]] x~ y h~ H w.v.h.p., dist(x,y) ≈ (1 - e-т)d ≈ тd qρ, q) . (Non-neg. lin. comb. of log-convex fcns. ∴ KH(τ) is also log-convex.) ∴ KH(ε) ≳ qρ KH(cε) ≲ q KH(τ) is log-convex ∴ lnKH(0) = 10 ∴ ln KH(ε) ≳ q ρρln q ln KH(cε) ≲ 0 ε cε 1 ln q c q q ln τ ln q ln KH(τ) 1 ∴ ρ ln q ≤ ln q c The End. Any questions?