Efficient Sketches for Earth-Mover Distance, with Applications David Woodruff IBM Almaden Joint work with Alexandr Andoni, Khanh Do Ba, and Piotr Indyk (Planar) Earth-Mover Distance • For multisets A, B of points in [∆]2, |A|=|B|=N, EMD( A, B) min : A B a (a) aA i.e., min cost of perfect matching between A and B EMD( , ) = 6 + 3√2 2 Geometric Representation of EMD • Map A, B to k-dimensional vectors F(A), F(B) – Image space of F “simple,” e.g., k small – Can estimate EMD(A,B) from F(A), F(B) via some efficient recovery algorithm E 2 Rk F E ≈ EMD(A,B) 3 Geometric Representation of EMD: Motivation • Visual search and recognition: – Approximate nearest neighbor under EMD • Reduces to approximate NN under simpler distances • Has been applied to fast image search and recognition in large collections of images [Indyk-Thaper’03, GraumanDarrell’05, Lazebnik-Schmid-Ponce’06] • Data streaming computation: – Estimating the EMD between two point sets given as a stream • Need mapping F to be linear: adding new point a to A translates to adding F(a) to F(A) • Important open problem in streaming [“Kanpur List ’06”] 4 Prior and New Results Main Theorem representation of EMD: Geometric ForPaper any ε 2 (0,1), there existsRecovery a distribution over linear mappings Dimension Approx. 2 ε ∆ ∆ 2 of equal size, we can F: R[Charikar’02, ! R s.t. for multisets A,B µ [∆] Indyk-Thaper’03] ℓ1 O(∆2) O(log ∆) produce an O(1/ε)-approximation to EMD(A,B) fromΩ(log F(A), F(B) 1/2 ∆) [Naor-Schechtman’06] ℓ1 Any with probability 2/3. Our result Non-norm O(∆ε) O(1/ε) 5 Implications • Streaming: Paper Space Approximation [Indyk’04] logO(1)(∆N) O(log ∆) Our result ∆ε logO(1)(∆N) O(1/ε) * N = number of points • Approximate nearest neighbor: Paper Space [Andoni-IndykKrauthgamer’09] s2+ε Our result 2∆ log(s∆N) ε 2∆ 1/α O(1) Query time Approximation ∆O(1) sε O((α/ε) loglog s) (∆ log(s∆N))O(1) O(1/ε) * s = number of data points (multisets) to preprocess α>1 free parameter 6 Proof Outline • Old [Agarwal-Varadarajan’04, Indyk’07]: – Extend EMD to EEMD which: • Handles sets of unequal size |A| · |B| in a grid of side-length k • EEMD(A,B) = min|S|=|A| and S µ B EMD(A,S) + k¢|B\S| • Is induced by a norm ||¢||EEMD, i.e., EEMD(A,B) = ||Â(A) – Â(B)||EEMD, 2 where Â(A)2 R∆ is the characteristic vector of A – Decomposition of EEMD into weighted sum of small EEMD’s • O(1/ε) distortion EMD over [∆]2 EEMD over [∆ε]2 + EEMD over [∆ε]2 EEMD over [∆ε]2 +…+ ∆O(1) terms • New: – Linear sketching of “sum-norms” 7 Old Idea [Indyk ’07] EMD over [∆]2 EEMD over [∆ε]2 EEMD over [∆ε]2 + EEMD over [∆ε]2 +…+ ∆O(1) terms EMD over [∆]2 EEMD over [∆1/2]2 EEMD over [∆1/2]2 +…+ Old Idea [Indyk ’07] Solve EEMD in each of ¢ cells, each a problem in [¢1/2]2 EMD over [∆]2 2 Old Idea [Indyk ’07] Solve one additional EEMD problem in [¢1/2]2 2 Should also scale edge lengths by ¢1/2 Old Idea [Indyk ’07] • Total cost is the sum of the two phases • Algorithm outputs a matching, so its cost is at least the EMD cost • Indyk shows that if we put a random shift of the [¢1/2]2 grid on top of the [¢]2 grid, algorithm’s cost is at most a constant factor times the true EMD cost • Recursive application gives multiple [¢ε]2 grids on top of each other, and results in O(1/ε)-approximation Main New Technical Theorem ||M1||X ||M||1, X = ||M2||X + ||Mn||X +…+ For normed space X = (Rt, ||¢||X) and M 2 Xn, denote ||M||1,X = ∑i ||Mi||X. Given C > 0 and λ > 0, if C/λ · ||M||1, X · C, there is a distribution over linear mappings μ: Xn ! X(λ log n) O(1) such that we can produce an O(1)-approximation to ||M||1,X from μ(M) w.h.p. 12 Proof Outline: Sum of Norms • First attempt: – Sample (uniformly) a few Mi’s to compute ||Mi||X – Problem: sum could be concentrated in 1 block M2 contains most of mass … M1 M2 M3 … Mn • Second attempt: – Sample Mi w/probability proportional to ||Mi||X [Indyk’07] – Problem: how to do online? – Techniques from [JW09, MW10]? • Need to sample/retrieve blocks, not just individual coordinates 13 Proof Outline: Sum of Norms (cont.) • Our approach: – Split into exponential levels: M = (M1, M2, …, Mn) S1 1 S2 M4, M7 S3 M1, M3, M8, M9 … • Assume ||M||1, X · C • Sk = {i2[n] s.t. ||Mi||X 2 (Tk, 2Tk]}, Tk=C/2k • Suffices to estimate |Sk| for each level k. How? – For each level k, subsample from [n] at a rate such that event Ek (“isolation” of level k) holds with probability proportional to |Sk| M2 Sℓ M5, M10, Mn M: Subsample: Ek? – Repeat experiment several times, count number of successes Y N 14 Proof Outline: Event Ek • Ek $ “isolation” of level k: – Exactly one i 2 Sk gets subsampled – Nothing from Sk’ for k’<k • Verification of trial success/failure – Hash subsampled elements Subsample: M1 M4 M5 M6 M9 M11 Mn–1 • Each cell maintains vector sum of subsampled Mi’s that hash there – Ek holds roughly (we “accept”) when: • 1 cell has X-norm in (0.9Tk, 2.1Tk] • All other cells have X-norm ≤ 0.9Tk ∑ ∑ ∑ ∑ – Check fails only if: • Elements from lighter levels contribute a lot to 1 cell • Elements from heavier levels subsampled and collide – Both unlikely if hash table big enough – Under-estimates |Sk|. If |Sk| > 2k/polylog(n), gives O(1)-approximation – Remark: triangle inequality of norm gives control over impact of collisions 15 Sketch and Recovery Algorithm Sketch: - For level everyk,k,create the estimator – For each t hash undertables estimates |Sk| – For each hash table: • Subsample [n], including each i2[n] w.p. pk = 2-k - If |Sk| > from 2k/polylog n, the estimator is (|S • Each cell maintains sum of Mi’s that hash to it k|) Recovery algorithm: { – For each level k, count number ck of “accepting” hash tables – Return ∑k Tk · (ck/t) · (1/pk) 16 EMD Wrapup • We achieve a linear embedding of EMD – with constant distortion, namely O(1/ε), – into a space of strongly sublinear dimension, namely ∆ε. • Open problems: – Getting (1+ε)-approximation / proving impossibility – Reducing dimension to logO(1)∆ / proving lower bound 17 What We Did • We showed that in a data stream, one can sketch ||M||1,X = ∑i ||Mi||X with space about the space complexity of computing (or sketching) ||¢||X • This quantity is known as a cascaded norm, written as L1(X) • Cascaded norms have many applications [CM, JW] • Can we generalize this? E.g., what about L2(X), i.e., (∑i ||Mi||2X )1/2 Cascaded Norms [JW09] • No! • L2(L1), i.e., (∑i ||Mi||2 1)1/2, requires (n1/2) space, where n is the number of different i, but sketching complexity of L1 is O(log n) • More generally, for p ¸ 1, Lp(L1), i.e., (∑i ||Mi||p 1)1/p is £(n11/p) space • So, L1(X) is very special Thank You!