Spectral Approaches to Nearest Neighbor Search arXiv:1408.0751 Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi Kannan Les Houches, January 2015 Nearest Neighbor Search (NNS) ο½ Preprocess: a set π of π points in βπ ο½ Query: given a query point π, report a point π∗ ∈ π with the smallest distance to π π∗ π Motivation ο½ Generic setup: ο½ ο½ ο½ Application areas: ο½ ο½ ο½ Points model objects (e.g. images) Distance models (dis)similarity measure machine learning: k-NN rule signal processing, vector quantization, bioinformatics, etc… Distance can be: ο½ Hamming, Euclidean, edit distance, earth-mover distance, … 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111 π∗ π Curse of Dimensionality ο½ All exact algorithms degrade rapidly with the dimension π Algorithm Query time Space Full indexing π(log π ⋅ π) No indexing – linear scan π(π ⋅ π) ππ(π) (Voronoi diagram size) π(π ⋅ π) Approximate NNS ο½ Given a query point π, report π′ ∈ π s.t. ∗ π′ − π ≤ π min π −π ∗ π ο½ ο½ ο½ π ≥ 1 : approximation factor randomized: return such π′ with probability ≥ 90% Heuristic perspective: gives a set of candidates (hopefully small) π∗ π π′ NNS algorithms It’s all about space partitions ! Low-dimensional [Arya-Mount’93], [Clarkson’94], [Arya-MountNetanyahu-Silverman-We’98], [Kleinberg’97], [HarPeled’02],[Arya-Fonseca-Mount’11],… ο½ High-dimensional [Indyk-Motwani’98], [Kushilevitz-OstrovskyRabani’98], [Indyk’98, ‘01], [Gionis-IndykMotwani’99], [Charikar’02], [Datar-ImmorlicaIndyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [AndoniIndyk’06], [Andoni-Indyk-NguyenRazenshteyn’14], [Andoni-Razenshteyn’15] ο½ 6 Low-dimensional kd-trees,… π =1+π runtime: π −π(π) ⋅ log π ο½ ο½ ο½ 7 High-dimensional Locality-Sensitive Hashing Crucial use of random projections ο½ ο½ ο½ Johnson-Lindenstrauss Lemma: project to random subspace of dimension π(π −2 log π) for 1 + π approximation Runtime: π1/π for π-approximation ο½ 8 Practice Data-aware partitions ο½ ο½ ο½ ο½ ο½ optimize the partition to your dataset PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-Dasgupta’09] randomized kd-trees [SilpaAnan-Hartley’08, Muja-Lowe’09] spectral/PCA/semantic/WTA hashing [Weiss-Torralba-Fergus’08, Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-RossLin’11] 9 Practice vs Theory ο½ Data-aware projections often outperform (vanilla) random-projection methods But no guarantees (correctness or performance) ο½ JL generally optimal [Alon’03, Jayram-Woodruff’11] ο½ ο½ Even for some NNS setups! [Andoni-Indyk-Patrascu’06] Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon? 10 Plan for the rest ο½ ο½ ο½ Model Two spectral algorithms Conclusion 11 Our model ο½ “low-dimensional signal + large noise” ο½ ο½ ο½ inside high dimensional space Signal: π ⊂ π for subspace π ⊂ βπ of dimension π βͺ π Data: each point is perturbed by a full-dimensional Gaussian noise ππ (0, π 2 πΌπ ) π 12 Model properties ο½ Data π = π + πΊ ο½ ο½ Query π = π + ππ s.t.: ο½ ο½ ο½ ο½ points in P have at least unit norm ||π − π∗ || ≤ 1 for “nearest neighbor” π∗ ||π − π|| ≥ 1 + π for everybody else Noise entries π(0, π 2 ) 1 up to factor poly(π −1 π log π) ο½ π≈ ο½ Claim: exact nearest neighbor is still the same π 1/4 Noise is large: ο½ ο½ ο½ 13 has magnitude π π ≈ π1/4 β« 1 top π dimensions of π capture sub-constant mass JL would not work: after noise, gap very close to 1 Algorithms via PCA ο½ Find the “signal subspace” π ? ο½ ο½ Use Principal Component Analysis (PCA)? ο½ ο½ ο½ then can project everything to π and solve NNS there ≈ extract top direction(s) from SVD e.g., π-dimensional space π that minimizes π∈π π 2 (π, π) If PCA removes noise “perfectly”, we are done: ο½ ο½ 14 π=π Can reduce to π-dimensional NNS NNS performance as if we are in π dimensions for full model? ο½ Best we can hope for ο½ dataset contains a “worst-case” π-dimensional instance ο½ Reduction from dimension π to π ο½ Spoiler: Yes 15 PCA under noise fails ο½ ο½ Does PCA find “signal subspace” π under noise ? No ο 2 (π, π) π π∈π ο½ PCA minimizes ο½ good only on “average”, not “worst-case” weak signal directions overpowered by noise directions typical noise direction contributes ππ=1 ππ2 π 2 = Θ(ππ 2 ) ο½ ο½ π∗ 16 1st Algorithm: intuition ο½ Extract “well-captured points” ο½ ο½ ο½ points with signal mostly inside top PCA space should work for large fraction of points Iterate on the rest π∗ 17 Iterative PCA • • • • Find top PCA subspace π πΆ=points well-captured by π Build NNS d.s. on {πΆ projected onto π} Iterate on the remaining points, π β πΆ Query: query each NNS d.s. separately ο½ To make this work: ο½ Nearly no noise in π: ensuring π close to π ο½ ο½ Capture only points whose signal fully in π ο½ 18 π determined by heavy-enough spectral directions (dimension may be less than π) well-captured: distance to π explained by noise only • • • • Simpler model ο½ Assume: small noise ο½ ππ = ππ + πΌπ , ο½ ο½ ο½ well-captured if π π, π ≤ 2πΌ Claim 1: if π∗ captured by πΆ, will find it in NNS ο½ ο½ ο½ Query: query each NNS separately can be even adversarial Algorithm: ο½ ο½ where ||πΌπ || βͺ π Find top-k PCA subspace π πΆ=points well-captured by π Build NNS on {πΆ projected onto π} Iterate on remaining points, π β πΆ for any captured π: ||ππ − ππ || = || π − π|| ± 4πΌ = ||π − π|| ± 5πΌ Claim 2: number of iterations is π(log π) ο½ ο½ ο½ 19 π∈π π 2 (π, π) ≤ π∈π π 2 π, π ≤ π ⋅ πΌ 2 for at most 1/4-fraction of points, π 2 π, π ≥ 4πΌ 2 hence constant fraction captured in each iteration Analysis of general model ο½ ο½ ο½ Need to use randomness of the noise Want to say that “signal” is stronger than “noise” (on average) Use random matrix theory ο½ ο½ π =π+πΊ πΊ is random π × π with entries π(0, π 2 ) ο½ ο½ π has rank ≤ π and (Frobenius-norm)2 ≥ π ο½ ο½ ο½ 20 All singular values π2 ≤ π 2 π ≈ π/ π important directions have π2 ≥ Ω(π/π) can ignore directions with π2 βͺ ππ/π Important signal directions stronger than noise! Closeness of subspaces ? ο½ Trickier than singular values ο½ ο½ Top singular vector not stable under perturbation! Only stable if second singular value much smaller ο½ How to even define “closeness” of subspaces? ο½ To the rescue: Wedin’s sin-theta theorem ο½ sin π π, π = max min ||π₯ − π¦|| π₯∈π π¦∈π |π₯|=1 21 π π π Wedin’s sin-theta theorem ο½ Developed by [Davis-Kahan’70], [Wedin’72] ο½ Theorem: ο½ Consider π = π + πΊ π is top-π subspace of π π is the π-space containing π ο½ Then: sin π π, π ≤ ο½ ο½ ο½ π ||πΊ|| ππ (π) Another way to see why we need to take directions with sufficiently heavy singular values 22 Additional issue: Conditioning ο½ After an iteration, the noise is not random anymore! ο½ non-captured points might be “biased” by capturing criterion ο½ Fix: estimate top PCA subspace from a small sample of the data ο½ Might be purely due to analysis ο½ 23 But does not sound like a bad idea in practice either Performance of Iterative PCA ο½ Can prove there are π ο½ In each, we have NNS in ≤ π dimensional space ο½ Overall query time: π ο½ Reduced to π 24 π log π iterations 1 ππ π ⋅ π ⋅ log 3/2 π π log π instances of π-dimension NNS! 2nd Algorithm: PCA-tree Closer to algorithms used in practice ο½ • • • • Find top PCA direction π£ Partition into slabs ⊥ π£ Snap points to ⊥ hyperplane Recurse on each slice ≈ π/π 25 Query: • follow all tree paths that may contain π∗ 2 algorithmic modifications Find top PCA direction π£ Partition into slabs ⊥ π£ Snap points to ⊥ hyperplanes Recurse on each slice • • • • ο½ Centering: ο½ ο½ ο½ Query: • follow all tree paths that may contain π∗ Need to use centered PCA (subtract average) Otherwise errors from perturbations accumulate Sparsification: ο½ ο½ Need to sparsify the set of points in each node of the tree Otherwise can get a “dense” cluster: ο½ ο½ 26 not enough variance in signal lots of noise Analysis ο½ An “extreme” version of Iterative PCA Algorithm: ο½ ο½ just use the top PCA direction: guaranteed to have signal ! Main lemma: the tree depth is ≤ 2π ο½ ο½ ο½ because each discovered direction close to π snapping: like orthogonalizing with respect to each one cannot have too many such directions π 2π π ο½ Query runtime: π ο½ Overall performs like π(π ⋅ log π)-dimensional NNS! 27 Wrap-up Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon? ο½ Here: ο½ ο½ ο½ ο½ Immediate questions: ο½ ο½ ο½ Model: “low-dimensional signal + large noise” like NNS in low dimensional space via “right” adaptation of PCA Other, less-structured signal/noise models? Algorithms with runtime dependent on spectrum? Broader Q: Analysis that explains empirical success? 28