Slides - University of New South Wales

NICTA Machine Learning Research Group Seminar APPROXIMATE NEAREST NEIGHBOR QUERIES WITH A TINY INDEX 6/26/14 Wei Wang, University of New South Wales 1 Outline 2    Overview of Our Research SRS: c-Approximate Nearest Neighbor with a tiny index [PVLDB 2015] Conclusions NICTA Machine Learning Research Group Seminar 6/26/14 Research Projects 3     Similarity Query Processing Keyword search on (Semi-) Structured Data Graph Succinct Data Structures NICTA Machine Learning Research Group Seminar 6/26/14 NN and c-ANN Queries 4 c-ANN = x x D = {a, b, x} aka. (1+ε)-ANN NN = a  Definitions A set of points, D = ∪i=1n {Oi}, in d-dimensional Euclidean space (d is large, e.g., hundreds)  Given a query point q, find the closest point, O*, in D  Relaxed version:  Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q)  May return a c-ANN point with at least constant probability NICTA Machine Learning Research Group Seminar 6/26/14  Applications and Challenges 5  Applications  Feature vectors: Data Mining, Multimedia DB  Fundamental geometric problem: “post-office problem”  Quantization in coding/compression …  Challenges  Curse of Dimensionality / Concentration of Measure  Hard  Large to find algorithms sub-linear in n and polynomial in d data size: 1KB for a single point with 256 dims NICTA Machine Learning Research Group Seminar 6/26/14 Existing Solutions 6  NN: Linear scan is (practically) the best approach using linear space & time O(d5log(n)) query time, O(n2d+δ) space  O(dn1-ε(d)) query time, O(dn) space  Linear scan: O(dn/B) I/Os, O(dn) space   (1+ε)-ANN LSH is the best approach using sub-quadratic space O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space  Probabilistic test  remove exponential dependency on d  Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2)) space  LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space   ρ= 1/(1+ε) + oc(1) NICTA Machine Learning Research Group Seminar 6/26/14 Approximate NN for Multimedia Retrieval 7      Cover-tree Spill-tree Reduce to NN search with Hamming distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-Means) NICTA Machine Learning Research Group Seminar 6/26/14 LSH is the best approach using sub-quadratic space Locality Sensitive Hashing (LSH) 8  Equality search Index: store o into bucket h(o)  Query: retrieve every o in the bucket h(q), verify if o = q   LSH  ∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o)  h :: Rd  Z  technically, dependent on r  “Near-by” points (blue) have more chance of colliding with q than “faraway” points (red) NICTA Machine Learning Research Group Seminar 6/26/14 LSH: Indexing & Query Processing 9  Index  For a fixed r sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩  store o into bucket sig(o)    Reduce Query Cost Iteratively increase r Query  Const Succ. Prob. Search with a fixed r  Retrieve and “verify” points in the bucket sig(q)   Repeat this L times (boosting) Galloping search to find the first Incurs additional cost + good r 2 only c quality guarantee NICTA Machine Learning Research Group Seminar 6/26/14 Locality Sensitive Hashing (LSH) 10  Standard LSH  c2-ANN   binary search on R(ci, ci+1)-NN problems LSH on external memory O((dn/B)0.5) query, O((dn/B)1.5) space  LSB-forest [SIGMOD’09, TODS’10]: different reduction from c2-ANN to a R(ci, ci+1)-NN problem O(n*log(n)/B) query, A  C2LSH [SIGMOD’12]: O(n*log(n)/B) space  Do not use composite hash keys  Perform fine-granular counting number of collisions in m LSH projections SRS O(n/B) query, (Ours) O(n/B) space NICTA Machine Learning Research Group Seminar 6/26/14 Weakness of Ext-Memory LSH Methods 11  Existing methods uses super-linear space  Thousands (or more) of hash tables needed if rigorous  People resorts to hashing into binary code (and using Hamming distance) for multimedia retrieval  Can only handle c, where c = x2, for integer x ≥ 2  To   enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c) Dataset Size Audio, 40MB LSB-forest 1500 MB C2LSH+ 127 MB NICTA Machine Learning Research Group Seminar SRS (Ours) 2 MB 6/26/14 SRS: Our Proposed Method 12  Solving c-ANN queries with O(n) query time and O(n) space with constant probability Constants hidden in O() is very small  Early-termination condition is provably effective   Advantages: Small index  Rich-functionality  Simple   Central idea: c-ANN query in d dims  kNN query in m-dims with filtering  Model the distribution of m “stable random projections”  NICTA Machine Learning Research Group Seminar 6/26/14 2-stable Random Projection 13    Let D be the 2-stable random projection = standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B ~ D, then x*A + y*B ~ (x2+y2)1/2 * D Illustration V r1 V r 2 Seminar NICTA Machine Learning Research Group 6/26/14 Dist(O) and ProjDist(O) and Their Relationship Proj(O) 14 O O Proj(Q) r1 Q Q O in d dims m 2-stable random projections r2    z1≔⟨V, r1⟩ ~ N(0, ǁvǁ) z2≔⟨V, r2⟩ ~ N(0, ǁvǁ)  z12+z22 ~ ǁvǁ2 *χ2m   Dist(O) (z1, … zm) in m dims ProjDist(O) i.e., scaled Chi-squared distribution of m degrees of freedom Ψm(x): cdf of the standardχ2m distribution NICTA Machine Learning Research Group Seminar 6/26/14 LSH-like Property 15  Intuitive idea: If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with high probability  But the inverse is NOT true  NN object in the projected space is most likely not the NN object in the original space with few projections, as  Many far-away objects projected before the NN/cNN objects  But we can bound the expected number of such cases! (say T)   Solution Perform incremental k-NN search on the projected space till accessing T objects  + Early termination test  NICTA Machine Learning Research Group Seminar 6/26/14 Indexing 16  Finding the minimum m  Input    Output       n, c, T ≔ max # of points to access by the algorithm m : # of 2-stable random projections T’ ≤ T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space index Generate m 2-stable random projections  n projected points in a m-dimensional space Index these projections using any index that supports incremental kNN search, e.g., R-tree Space cost: O(m * n) = O(n) NICTA Machine Learning Research Group Seminar 6/26/14 SRS-αβ(T, c, pτ) 17 Early-termination test: c 2 * ProjDist2 (ok )  m   p 2  Dist (omin )  Compute proj(Q)  Do incremental kNN search from proj(Q)  // stopping condition α for k = 1 to T   Compute Dist(Ok)  Maintain Omin = argmin1≤i≤k Dist(Oi)  If early-termination test (c, pτ) = TRUE // stopping condition β  BREAK  Return Omin c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18 Index = 0.0059n, Query = 0.0084n, succ prob = 0.13 Main Theorem: NICTA Machine Research p Group Seminar SRS-αβ returns a c-NN point withLearning probability with 6/26/14 O(n) I/O cost τ-f(m,c) Variations of SRS-αβ(T, c, pτ) 18 Compute proj(Q)  Do incremental kNN search from proj(Q) for k = 1 to T // stopping condition α Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE // stopping condition β       1. 2. 3. BREAK Return Omin SRS-α SRS-β SRS-αβ(T, c’, pτ) Better quality; query cost is O(T) Best quality; query cost bounded by O(n); handles c = 1 Better quality; query cost bounded by O(T) All with success probability least p6/26/14 NICTA Machine Learning Research GroupatSeminar τ Other Results 19    Can be easily extended to support top-k c-ANN queries (k > 1) No previous known guarantee on the correctness of returned results We guarantee the correctness with probability at least pτ, if SRS-αβ stops due to early-termination condition  ≈100% in practice (97% in theory) NICTA Machine Learning Research Group Seminar 6/26/14 Analysis 20 NICTA Machine Learning Research Group Seminar 6/26/14 Stopping Condition α 21    “near” point: the NN point  its distance ≕ r “far” points: points whose distance > c * r Then for any κ > 0 and any o: Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2)  Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)      Both because ProjDist2(o)/Dist2(o) ~ χ2m Pr[the NN point projected beforeκ*r] ≥ ψm(κ2) Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T) Choose κ such that P1 + P2 – 1 > 0  Feasible due to good concentration bound for χ2m NICTA Machine Learning Research Group Seminar 6/26/14 P1 P2 Choosing κ 22     κ*r NICTA Machine Learning Research Group Seminar Let c = 4 Mode = m – 2 Blue: 4 Red: 4*(c2) = 64 6/26/14 ProjDist(OT): Case I Consider cases where both conditions hold (re. near and far points)  P1 + P2 – 1 probability 23 ProjDist(o) in m-dims ProjDist(OT) NICTA Machine Learning Research Group Seminar Omin = the NN point 6/26/14 ProjDist(OT): Case II Consider cases where both conditions hold (re. near and far points)  P1 + P2 – 1 probability 24 ProjDist(o) in m-dims ProjDist(OT) Omin = a cNN point NICTA Machine Learning Research Group Seminar 6/26/14 Early-termination Condition (β) 25  Omit the proof here  Also relies on the fact that the squared sum of m projected distances follows a scaled χ2m distribution  Key to  Handle the case where c = 1  Returns the NN point with guaranteed probability  Impossible to handle by LSH-based methods  Guarantees the correctness of top-k cANN points returned when stopped by this condition  No such guarantee by any previous method NICTA Machine Learning Research Group Seminar 6/26/14 Experiment Setup 26  Algorithms  LSB-forest [SIGMOD’09, TODS’10]  C2LSH [SIGMOD’12]  SRS-* [VLDB’15]   Data Measures  Index size, query cost, result quality, success probability NICTA Machine Learning Research Group Seminar 6/26/14 Datasets 27 5.6PB 369GB 16GB NICTA Machine Learning Research Group Seminar 6/26/14 Tiny Image Dataset (8M pts, 384 dims) 28    Fastest: SRS-αβ, Slowest: C2LSH  Quality the other way around SRS-αhas comparable quality with C2LSH yet has much lower cost. SRS-* dominates LSBforest NICTA Machine Learning Research Group Seminar 6/26/14 Approximate Nearest Neighbor 29    Empirically better than the theoretic guarantee With 15% I/Os of linear scan, returns NN with probability 71% With 62% I/Os of linear scan, returns NN with probability 99.7% NICTA Machine Learning Research Group Seminar 6/26/14 Large Dataset (0.45 Billion) 30 NICTA Machine Learning Research Group Seminar 6/26/14 Summary 31    c-ANN queries in arbitrarily high dim space  kNN query in low dim space Our index size is approximately d/m of the size of the data file Opens up a new direction in c-ANN queries in highdimensional space  Find efficient solution to kNN problem in 6-10 dimensional space NICTA Machine Learning Research Group Seminar 6/26/14 Q&A 32 Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html NICTA Machine Learning Research Group Seminar 6/26/14

Slides - University of New South Wales

Related documents

Products

Support

Slides - University of New South Wales

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib