Search k-Nearest Neighbors in High Dimensions Tomer Peled Dan Kushnir Tell me who your neighbors are, and I'll know who you are Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse • • • • Locality Sensitive Hashing (high dimension approximate solutions) l2 extension • Applications (Dan) • Nearest Neighbor Search Problem definition • Given: a set P of n points in Rd Over some metric • find the nearest neighbor p of q in P Q? Distance metric Applications Classification • Clustering • Segmentation • Indexing • Dimension reduction • (e.g. lle) Weight q? color Naïve solution No preprocess • Given a query point q • Go over all n points – Do comparison in Rd – query time = O(nd) • Keep in mind Common solution Use a data structure for acceleration • Scale-ability with n & with d is important • When to use nearest neighbor High level algorithms Parametric Non-parametric Probability distribution estimation complex models Sparse data Density estimation Nearest neighbors High dimensions Assuming no prior knowledge about the underlying probability structure Nearest Neighbor q? min pi P dist(q,pi) r, - Nearest Neighbor (1 + ) r q? r dist(q,p1) r dist(q,p2) (1 + ) r r2=(1 + ) r1 Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan) • • • • • • The simplest solution Lion in the desert • Quadtree Split the first dimension into 2 Repeat iteratively Stop when each cell has no more than 1 data point Quadtree - structure P<X1 P<Y1 X1,Y1 P<X1 P≥Y1 X1,Y1 Y X P≥X1 P<Y1 P≥X1 P≥Y1 Query - Quadtree P<X1 P<Y1 X1,Y1 P<X1 P≥Y1 X1,Y1 Y X In many cases works P≥X1 P<Y1 P≥X1 P≥Y1 Pitfall1 – Quadtree P<X1 P<Y1 X1,Y1 P<X1 P≥Y1 X1,Y1 Y P<X1 X In some cases doesn’t P≥X1 P<Y1 P≥X1 P≥Y1 Pitfall1 – Quadtree Y X In some cases nothing works pitfall 2 – Quadtree X Y d O(2 ) Could result in Query time Exponential in #dimensions Space partition based algorithms Could be improved Multidimensional access methods / Volker Gaede, O. Gunther Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan) • • • • • • Curse of dimensionality O(nd) Query space n• d) ) O(ormin(nd, Naive time D>10..20 worst than sequential scan • For most geometric distributions – Techniques specific to high dimensions are needed • •Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002 Curse of dimensionality Some intuition 2 22 23 2d Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan) • • • • • • Preview General Solution – • Locality sensitive hashing Implementation for Hamming space • Generalization to l1 & l2 • Hash function Hash function Data_Item Hash function Key Bin/Bucket Hash function Data structure X=Number in the range 0..n X modulo 3 0 0..2 Storage Address Usually we would like related Data-items to be stored at the same bin Recall r, - Nearest Neighbor (1 + ) r q? r dist(q,p1) r dist(q,p2) (1 + ) r r2=(1 + ) r1 Locality sensitive hashing q? (1 + ) r r (r, ,p1,p2) Sensitive P1 ≡Pr[I(p)=I(q)] is “high” if p is “close” to q P2 ≡Pr[I(p)=I(q)] is “low” if p is”far” from q r2=(1 + ) r1 Preview General Solution – • Locality sensitive hashing Implementation for Hamming space • Generalization to l1 & l2 • Hamming Space Hamming space = 2N binary strings • Hamming distance = #changed digits • Richard Hamming a.k.a Signal distance Hamming Space N Hamming space • 010100001111 Hamming distance • 010100001111 Distance = 4 010010000011 SUM(X1 XOR X2) L1 to Hamming Space Embedding C=11 2 p 8 11000000000 11111111000 d’=C*d 11000000000 11111111000 Hash function 11000000000 1 0 11111111000 1 p ∈ Hd’ Lj Hash function Gj(p)=p|Ij j=1..L, k=3 digits Bits sampling from p Store p into bucket p|Ij 101 2k buckets Construction p 1 2 L Query q 1 2 L Alternative intuition random projections C=11 2 p 8 11000000000 11111111000 d’=C*d 11000000000 11111111000 Alternative intuition random projections C=11 2 p 8 11000000000 11111111000 11000000000 11111111000 Alternative intuition random projections C=11 2 p 8 11000000000 11111111000 11000000000 11111111000 Alternative intuition random projections 11000000000 1 0 11111111000 1 110 111 100 101 p 101 23 Buckets 000 001 k samplings Repeating Repeating L times Repeating L times Secondary hashing 2k buckets 011 Simple Hashing Size=B M Buckets M*B=α*n α=2 Support volume tuning dataset-size vs. storage volume The above hashing is locality-sensitive k Probability Probability (p,q k=1 Distance (q,pi) Adopted from Piotr Indyk’s slides Pr Distance ( p, q ) in1same bucket) = # dimensions k=2 Distance (q,pi) • Preview General Solution – • Locality sensitive hashing Implementation for Hamming space • Generalization to l2 • Direct L2 solution New hashing function Still based on sampling Using mathematical trick P-stable distribution for Lp distance Gaussian distribution for L2 distance • • • • • Central limit theorem v1* +v2* +… …+vn* (Weighted Gaussians) = Weighted Gaussian = Central limit theorem v1* X1 +v2* X2 +… …+vn* Xn v1..vn = Real Numbers X1:Xn = Independent Identically Distributed (i.i.d) = Central limit theorem 1/ 2 2 i vi X i i | vi | Dot Product Norm X Norm Distance 1/ 2 2 i ui X i i vi X i i | ui vi | Features Features vector 1 vector 2 Distance X Norm Distance 1/ 2 2 i ui X i i vi X i i | ui vi | Dot Dot Product Product Distance X d random* numbers 1 The full Hashing Features vector Discretization step Random[0,w] 22 77 42 [34 82 21] phase d +b w a v b ha ,b (v) w The full Hashing 7944 +34 100 7800 7900 8000 8100 8200 a v b ha ,b (v) w The full Hashing phase Random[0,w] 7944 Discretization step +34 100 a v b ha ,b (v) w The full Hashing Features vector i.i.d from p-stable distribution 1 Discretization step Random[0,w] v a phase d +b w a v b ha ,b (v) w Generalization: P-Stable distribution Lp p=eps..2 • L2 • Generalized • Central Limit Theorem • Central Limit Theorem Gaussian (normal) • P-stable distribution • Cauchy for L2 distribution P-Stable summary r, - Nearest Neighbor Works for • Generalizes to 0<p<=2 • Improves query time • Latest results Reported in Email by Alexander Andoni Query time = O (dn1/(1+)log(n) ) O (dn1/(1+)^2log(n) ) Parameters selection 90% Probability Best quarry time performance • For Euclidean Space L Parameters selection … Single projection hit an - Nearest Neighbor • with Pr=p1 k projections hits an - Nearest Neighbor • with Pr=p1k L hashings fail to collide with Pr=(1-p1k)L • To ensure Collision (e.g. 1-δ≥90%) • 1- (1-p1k)L≥ 1-δ • For Euclidean Space L log( ) k log( 1 p1 ) K … Parameters selection time Candidates verification Candidates extraction k Pros. & Cons. Better Query Time than Spatial Data Structures Scales well to higher dimensions and larger data size ( Sub-linear dependence ) Predictable running time Extra storage over-head Inefficient for data with distances concentrated around average works best for Hamming distance (although can be generalized to Euclidean space) In secondary storage, linear scan is pretty much all we can do (for high dim) requires radius r to be fixed in advance From Pioter Indyk slides Conclusion ..but at the end • everything depends on your data set Try it at home • Visit: – http://web.mit.edu/andoni/www/LSH/index.html Andoni@mit.edu Email Alex Andoni – Test over your own data – (C code under Red Hat Linux ) LSH - Applications • Searching video clips in databases .("Hierarchical, Non-Uniform Locality Sensitive Hashing and Its Application to Video Identification“, Yang, Ooi, Sun). • • • • • • • • • • Searching image databases Image segmentation Image classification Texture classification Clustering Embedding and manifold learning Compression – vector quantization. Search engines Genomics In short: whenever K-Nearest Neighbors (KNN) are needed. (see the following). (see the following). (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani). (see the following). (see the following). (LLE, and many others) (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”). (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler). Motivation • A variety of procedures in learning require KNN computation. • KNN search is a computational bottleneck. • LSH provides a fast approximate solution to the problem. • LSH requires hash function construction and parameter tunning. Outline Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell. • Finding sensitive hash functions. Mean Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer • • Tuning LSH parameters. LSH data structure is used for algorithm speedups. Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell The Problem: Given an image x, what are the parameters θ, in this image? i.e. angles of joints, orientation of the body, etc. i Ingredients • Input query image with unknown angles (parameters). • Database of human poses with known angles. • Image feature extractor – edge detector. • Distance metric in feature space dx. • Distance metric in angles space: m d (1 , 2 ) 1 cos(1i 2i ) i 1 Example based learning • Construct a database of example images with their known angles. • Given a query image, run your favorite feature extractor. • Compute KNN from database. • Use these KNNs to compute the average angles of the query. Input: query Find KNN in database of examples Output: Average angles of KNN The algorithm flow Input Query Processed query Features extraction Database of examples Output Match Feature Extraction PSH LWR The image features Image features are multiscale edge histograms: B A 0, 3 , , , 4 2 4 107 ( x) A x / 4 Feature Extraction PSH LWR PSH: The basic assumption There are two metric spaces here: feature space (d x) and parameter space ( d ). We want similarity to be measured in the angles space, whereas LSH works on the feature space. • Assumption: The feature space is closely related to the parameter space. Feature Extraction PSH LWR Insight: Manifolds • Manifold is a space in which every point has a neighborhood resembling a Euclid space. • But global structure may be complicated: curved. • For example: lines are 1D manifolds, planes are 2D manifolds, etc. q Feature Space Is this Magic? Parameters Space (angles) Feature Extraction PSH LWR Parameter Sensitive Hashing (PSH) The trick: Estimate performance of different hash functions on examples, and select those sensitive to d : The hash functions are applied in feature space but the KNN are valid in angle space. Feature Extraction PSH LWR PSH as a classification problem Label pairs of examples with similar angles Compare labeling Define hash functions h on feature space Predict labeling of similar\ non-similar examples by using h If labeling by h is good accept h, else change h Feature Extraction PSH LWR A pair of examples (x i , i ), ( x j , j ) is labeled : 1 if d ( i , j ) r y ij 1 if d ( i , j ) r (1 ) Labels: (r=0.25) +1 +1 -1 -1 Feature Extraction PSH LWR features A binary hash function: Feature 1 if (x) T h ,T ( x) - 1 otherwise Predict th e labels 1 if h ,T (xi ) h ,T (x j ) yˆ h(xi ,x j ) 1 otherwise Feature Extraction PSH LWR h ,T will place both examples in the same bin or separate them : T (x) Find the best T* that predicts the true labeling with the probabilit ies constraints. PSH Feature Extraction LWR Local Weighted Regression (LWR) • Given a query image, PSH returns KNNs. • LWR uses the KNN to compute a weighted average of the estimated *angles of the query: arg min d ( g ( xi , ), i ) K (d X ( xi , x0 )) x N ( x ) i 0 dist. weight Results Synthetic data were generated: • 13 angles: 1 for rotation of the torso, 12 for joints. • 150,000 images. • Nuisance parameters added: clothing, illumination, face expression. • • 1,775,000 example pairs. Selected 137 out of 5,123 meaningful features Recall: (how??): P1 is prob of positive 18 bit hash functions (k), 150 hash tables (l). hash. P2 is prob of bad hash. B is the max number of pts in a bucket. • Without selection needed 40 bits and 1000 hash tables. • Test on 1000 synthetic examples: • PSH searched only 3.4% of the data per query. Results – real data • 800 images. • Processed by a segmentation algorithm. • 1.3% of the data were searched. Results – real data Interesting mismatches Fast pose estimation - summary • Fast way to compute the angles of human body figure. • Moving from one representation space to another. • Training a sensitive hash function. • KNN smart averaging. Food for Thought • The basic assumption may be problematic (distance metric, representations). • The training set should be dense. • Texture and clutter. • General: some features are more important than others and should be weighted. Food for Thought: Point Location in Different Spheres (PLDS) • Given: n spheres in Rd , centered at P={p1,…,pn} with radii {r1,…,rn} . • Goal: given a query q, preprocess the points in P to find point pi that its sphere ‘cover’ the query q. ri q pi Courtesy of Mohamad Hegaze Mean-Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer Motivation: • Clustering high dimensional data by using local density measurements (e.g. feature space). • Statistical curse of dimensionality: sparseness of the data. • Computational curse of dimensionality: expensive range queries. • LSH parameters should be adjusted for optimal performance. Outline • Mean-shift in a nutshell + examples. Our scope: • Mean-shift in high dimensions – using LSH. • Speedups: 1. Finding optimal LSH parameters. 2. Data-driven partitions into buckets. 3. Additional speedup by using LSH data structure. Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Mean-Shift in a Nutshell bandwidth point Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct KNN in mean-shift Bandwidth should be inversely proportional to the density in the region: high density - small bandwidth low density - large bandwidth Based on kth nearest neighbor of the point The bandwidth is Adaptive mean-shift vs. non-adaptive. Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Image segmentation algorithm 1. Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y) 2. Resolution controlled by the bandwidth: hs (spatial), hr (color) 3. Apply filtering 3D: Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’ Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Image segmentation algorithm Filtering: pixel value of the nearest mode Mean-shift trajectories original filtered segmented Filtering examples original squirrel filtered original baboon filtered Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’ Segmentation examples Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’ Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Mean-shift in high dimensions Statistical curse of dimensionality: Sparseness of the data variable bandwidth Computational curse of dimensionality: Expensive range queries implemented with LSH Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct LSH-based data structure • Choose L random partitions: Each partition includes K pairs (dk,vk) • For each point we check: xi ,d K vk It Partitions the data into cells: Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Choosing the optimal K and L • For a query q compute smallest number of distances to points in its buckets. Mean-shift LSH LSH: optimal k,l N Cl n( K / d 1) N C LN Cl LSH: data partition LSH: data struct d C C As L increases C increases but C decreases. C determines the resolution of the data structure. Large k smaller number of points in a cell. If L is too small points might be missed, but if L is too big C might include extra points Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Choosing optimal K and L Determine accurately the KNN for m randomly-selected data points. distance (bandwidth) Choose error threshold The optimal K and L should satisfy the approximate distance Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Choosing optimal K and L • For each K estimate the error for • In one run for all L’s: find the minimal L satisfying the constraint L(K) • Minimize time t(K,L(K)): minimum Approximation error for K,L L(K) for =0.05 Running time t[K,L(K)] Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Data driven partitions • In original LSH, cut values are random in the range of the data. • Suggestion: Randomly select a point from the data and use one of its coordinates as the cut value. uniform data driven points/bucket distribution Mean-shift LSH LSH: optimal k,l LSH: data partition LSH: data struct Additional speedup Assume that all points in C will converge to the same mode. (C is like a type of an aggregate) C C Speedup results 65536 points, 1638 points sampled , k=100 Food for thought Low dimension High dimension A thought for food… • Choose K, L by sample learning, or take the traditional. • Can one estimate K, L without sampling? • A thought for food: does it help to know the data dimensionality or the data manifold? • Intuitively: dimensionality implies the number of hash functions needed. • The catch: efficient dimensionality learning requires KNN. 15:30 cookies….. Summary • LSH suggests a compromise on accuracy for the gain of complexity. • Applications that involve massive data in high dimension require the LSH fast performance. • Extension of the LSH to different spaces (PSH). • Learning the LSH parameters and hash functions for different applications. Conclusion • ..but at the end everything depends on your data set • Try it at home – Visit: http://web.mit.edu/andoni/www/LSH/index.html – Email Alex Andoni Andoni@mit.edu – Test over your own data (C code under Red Hat Linux ) Thanks • • • • Ilan Shimshoni (Haifa). Mohamad Hegaze (Weizmann). Alex Andoni (MIT). Mica and Denis.