Nearest Neighbor Retrieval Using Distance-Based Hashing supervised by Prof George Kollios

Database Group Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios Hash Based Indexing  Same probability on a k-bit hash table: Idea: 1. Come up with hash functions neighbor N(Q): Accuracyk ,l   Q X COST MODEL: Minimize number of Distance Computations tables some buckets  Ck ,l Q, N Q  Pr Q dQ LookupCostk ,l Q    Ck ,l Q, x  xU  HashCost: # of distance computations to evaluate h-functions: HashCost k ,l Q   2kl 3. At query time apply the same  Total Cost per query: Cost k ,l Q   LookupCostk ,l Q   HashCost k ,l Q  hash function to the query  Efficiency (for all Queries): Cost k ,l   QX query 4. Filter: Retrieve the collisions. Cost k ,l Q  Pr Q dQ Use Sampling to estimate Accuracy and Efficiency The rest of the database is pruned. h Sample Queries Sample Database Objects Sample Hash Functions Compute Integrals 1. 2. 3. 5. Refine: Compute actual distances. Return the object with the smallest distance as the NN. 4. D D D 800 Finding optimal k & l min ..given accuracy (say 90%)… ..For k=1,2,… ..compute smallest l that yields required accuracy. 700 600 500 400 300 200 100 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 C(Q,N(Q)) Typically, optimal k : last k for which efficiency improves. Additional Optimizations C Dx1 , x2   r1  PrhH hx1   hx2   p1 Hierarchical DBH   r2 A d1 d2 d3 D(Q,N(Q)) Reduce Hash Cost r1 Amplify the gap between p1 and p2: Randomly pick l hash vectors of k functions each. Probability of collision in at least one of l hash tables: 0 Rank Queries according to D(Q,N(Q) Divide space into disjoint subsets (equi-height) Train separate indices for each subset  B Dx1 , x2   r2  PrhH hx1   hx2   p2 Use small number of “pseudoline” points  Experiments   Pr dist  r   1  1  p  Pr dist  r1   1  1  p1 Computing D may be very expensive Dynamic Time Warping for Time Series  Edit Distance Variants for DNA alignment  Ck ,l x1 , x2   1  1  C x1 , x2  k l  LookupCost: Expected number of objects that collide in at least one of the l hash 2. Hash every database object to r1 , r2 , p1 , p2 r1  r2  p1  p2  is: for a previous unseen query q, locate a point p of the database such that the distance between q and every point o of the database is greater or equal than the distance between p and q. k  Accuracy, i.e. the probability over all queries Q that we will retrieve the nearest that hash similar objects to similar buckets Locality Sensitive Family of Functions NEAREST NEIGHBOR: Given a database S, a distance function D our task Ck x1 , x2   Cx1 , x2   Prob of collision in at least one of the l hash tables: h Locality Sensitive Hashing Problem C x1 , x2   PrhH DBH hx1   hx2   Probability of collision between any two objects: Number of Queries A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multi-bit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing. Analysis k l k l 2 PROBLEM DEFINITION: Define index structure to answer Nearest Neighbor queries efficiently 2 H using Pseudoline Projections (HDBH) Works on Arbitrary Space but is not Locality Sensitive! A SOLUTION: Brute Force! Try them all and get the exact answer OUR SOLUTION: Are we willing to trade accuracy for efficiency ? ACCURACY vs. EFFICIENCY: How often is the actual NN retrieved? How much time does NN retrieval take? Distance Matrix 0 5 3 0 ... 4… … ... 0 TRAINING PHASE Desired Accuracy Prev DBH Index Structure Define a line projection function that maps an arbitrary space into the real line R: uns een que ry 2 2 2 x1 Real valued  Discrete valued:  NN ….with statistical arguments F(x)  x1 , x2 t1 ,t 2 x   0  0.5 ) x2 Conclusion D(x1,x2) x1 , x2  x  t1 , t2  0 if F x1 , x2 Ft1 ,t2 x    1 otherwise V x1 , x2   t1 , t 2  PrxX F DBH Index Structure D( x,x 2 D x, x1   D x1 , x2   Dx, x2  F x1 , x2  x   2 D x1 , x2  Hash tables should be balanced. Thus t1, t2 are chosen from V: ious x  R t2 t1 0 0 1     General purpose Distance is black box Does not require metric properties Statistical analysis is possible Even when NN is not returned, a very close N is returned… For many applications that’s fine!!  Not sublinear in size of DB  Statistical (not probabilistic)  Need “representative” sample sets  Hands dataset .. actual performance was different than simulation .. – the training set was not representative!

Nearest Neighbor Retrieval Using Distance-Based Hashing supervised by Prof George Kollios

Related documents

Products

Support

Nearest Neighbor Retrieval Using Distance-Based Hashing supervised by Prof George Kollios

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib