Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research) Nearest Neighbor Search (NNS) Preprocess: a set D of points in Rd Query: given a new point q, report a point pD with the smallest distance to q p q Motivation Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure Application areas: machine learning, data mining, speech recognition, image/video/music clustering, bioinformatics, etc… Distance can be: Euclidean, Hamming, ℓ∞, edit distance, Ulam, Earth-mover distance, etc… Primitive for other problems: find the closest pair in a set D, MST, clustering… p q Plan for today 1. NNS for “basic” distances: LSH 2. NNS for “advanced” distances: embeddings 2D case Compute Voronoi diagram Given query q, perform point location Performance: Space: O(n) Query time: O(log n) High-dimensional case All exact algorithms degrade rapidly with the dimension d Algorithm Full indexing No indexing – linear scan Query time Space O(d*log n) nO(d) (Voronoi diagram size) O(dn) O(dn) When d is high, state-of-the-art is unsatisfactory: Even in practice, query time tends to be linear in n Approximate NNS r-near neighbor: given a new point q, report a point pD s.t. ||p-q||≤rcr as long as there exists a point at distance ≤r r cr q p Approximation Algorithms for NNS A vast literature: With exp(d) space or Ω(n) time: [Arya-Mount-et al], [Kleinberg’97], [Har-Peled’02],… With poly(n) space and o(n) time: [Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [ChakrabartiRegev’04], [Panigrahy’06], [Ailon-Chazelle’06], [AIndyk’06]… The landscape: algorithms Space Time Comment Reference Space: poly(n). n4/ε2+nd O(d*log n) c=1+ε Query: logarithmic [KOR’98, IM’98] Space: small poly n1+ρ +nd dnρ (close to linear). Query: poly (sublinear). ρ≈1/c [IM’98, Cha’02, DIIM’04] Space: near-linear. nd*logn dnρ Query: poly (sublinear). ρ=2.09/c [Ind’01, Pan’06] ρ=O(1/c2) [AI’06] ρ=1/c2 +o(1) [AI’06] Locality-Sensitive Hashing [Indyk-Motwani’98] q Random hash function g: RdZ s.t. for any points p,q: p If ||p-q|| ≤ r, then Pr[g(p)=g(q)] is “high” “not-so-small” If ||p-q|| >cr, then Pr[g(p)=g(q)] is “small” Use several hash tables: nρ, where ρ s.t. Pr[g(p)=g(q)] 1 P1 P2 ||p-q|| r cr Example of hash functions: grids [Datar-Immorlica-Indyk-Mirrokni’04] Pick a regular grid: Shift and rotate randomly Hash function: g(p) = index of the cell of p Gives ρ ≈ 1/c p Near-Optimal LSH [A-Indyk’06] Regular grid → grid of balls p p can hit empty space, so take more such grids until p is in a ball Need (too) many grids of balls Start by reducing dimension to t Analysis gives Choice of reduced dimension t? 2D Tradeoff between # hash tables, n, and Time to hash, tO(t) Total query time: dn1/c2+o(1) p Rt p Proof idea Claim: , where P(r)=probability of collision when ||p-q||=r Intuitive proof: Let’s ignore effects of reducing dimension P(r) = intersection / union P(r)≈random point u beyond the dashed line The x-coordinate of u has a nearly Gaussian distribution → P(r) exp(-A·r2) qq r p P(r) x u The landscape: lower bounds Space Time Comment Space: poly(n). n4/ε2+nd O(d*log n) c=1+ε Query: logarithmic o(1/ε2) n ω(1) memory lookups Space: small poly 1+ρ n +nd dnρ (close to linear). Query: poly (sublinear). n1+o(1/c2) ρ≈1/c [KOR’98, IM’98] [AIP’06] [IM’98, Cha’02, DIIM’04] ρ=1/c2 +o(1) [AI’06] ρ≥1/c2 ω(1) memory lookups Space: near-linear. nd*logn dnρ Query: poly (sublinear). Reference [MNP’06, OWZ’10] [PTW’08, PTW’10] ρ=2.09/c [Ind’01, Pan’06] ρ=O(1/c2) [AI’06] Open Question #1: Design space partitioning of Rt that is efficient: point location in poly(t) time qualitative: regions are “sphere-like” 2 c [Prob. needle of length 1 is cut] ≥ [Prob needle of length c is cut] LSH beyond NNS [A-Indyk’07, Rahami-Recht’07, A’09] Approximating Kernel Spaces (obliviously) Problem: For x,yRd, can define inner product K(x,y)=e-||x-y|| Implicitly, means K(x,y) = ϕ(x) * ϕ(y) Can we obtain explicit and efficient ϕ? approximately? (can’t do exactly) Yes, for some kernels, via LSH: E.g., map ϕ’(x)=(r1(g1(x)), r2(g2(x)…) • gi’s are LSH functions on Rd, ri’s map into random ±1 Get: ±ε approximation in O(ε-2 * log n) dimensions Sketching (≈ dimensionality reduction in a computational space) [KOR’98,…] Plan for today 1. NNS for basic distances 2. NNS for advanced distances: embeddings NNS beyond LSH Distances, so far LSH good for: Hamming, Euclidean Space Hamming (ℓ1) n1+ρ +nd Euclidean (ℓ2) Time Comment Reference dnρ ρ=1/c [IM’98, Cha’02, DIIM’04] ρ≥1/c [MNP06,OZW10,PTW’08’10] ρ≈1/c2 ρ≥1/c2 [AI’06] [MNP06,OZW10,PTW’08’10] How about other distances (not ℓp’s) ? Earth-Mover Distance (EMD) Given two sets A, B of points, EMD(A,B) = min cost bipartite matching between A and B Points can be in plane, ℓ2d… Applications: image search Images courtesy of Kristen Grauman (UT Austin) Reductions via embeddings For each XM, associate a vector f(X), such that for all X,YM ||f(X) - f(Y)||2 approximates original distance between X and Y Up to some distortion (approximation) Then can use NNS for Euclidean space! Can also consider other “easy” f distances between f(x), f(y) Most popular host: ℓ1≡Hamming ℓ1=real space with distance ||x-y||1=∑i |xi-yi| f Earth-Mover Distance over 2D into ℓ1 [Cha02, IT03] Sets of size s in [1…s]x[1…s] box Embedding of set A: impose randomly-shifted grid Each grid cell gives a coordinate: f (A)c=#points in the cell c Subpartition the grid recursively, and assign new coordinates for each new cell (on all levels) 00 02 00 11 12 01 01 22 20 00 Distortion: O(log s) 21 Embeddings of various metrics Embeddings into ℓ1 Metric Upper bound Lower bound Earth-mover distance (s-sized sets in 2D plane) O(log s) Ω(log1/2 s) [Cha02, IT03] [NS07] Ω(log s) Earth-mover distance O(log s*log d) Open Question #3: [AIK08] (s-sized sets in {0,1}d) Improve the distortion of embedding d Õ(√log d) Ω(log d) Edit distance over {0,1} 2 EMD, W2, edit distance into ℓ1 [OR05] [KN05,KR06] (= #indels to tranform x->y) Ulam (edit distance between non-repetitive strings) O(log d) Block edit distance Õ(log d) [KN05] Ω̃(log d) [AK07] [CK06] [MS00, CM07] 4/3 [Cor03] Really beyond LSH NNS for ℓ∞ Space Time Comment Reference n1+ρ O(d log n) c≈logρlog d [I’98] via decision trees cannot do better via (deterministic) decision trees NNS for mixed norms, e.g. ℓ2(ℓ1) [I’04,AIK’09,A’09] Embedding into mixed norms [AIK’09] Ulam O(1)-embeds in ℓ22(ℓ∞(ℓ1)) of small dimension yields NNS with O(log log d) approximation Ω̃(log d) if would embed into each separate norm! Open Question #4: Embed EMD, edit distance into mixed norms? 23 Summary: high-d NNS Locality-sensitive hashing: For Hamming, Euclidean spaces Provably (near) optimal NNS in some regimes Applications beyond NNS: kernels, sketches Beyond LSH Non-normed distances: via embeddings into ℓ1 Algorithms for ℓp and mixed norms (of ℓp‘s) Some open questions: Design qualitative, efficient LSH / space partitioning (in Euclidean space) Embed “harder” distances (like EMD, edit distance) into ℓ1, or mixed norms (of ℓp’s) ? Is there an LSH for ℓ∞ ? NNS for any norm: e.g. trace norm?