Efficient Sketches for Earth-Mover Distance, with Applications

advertisement
Efficient Sketches for Earth-Mover
Distance, with Applications
David Woodruff
IBM Almaden
Joint work with Alexandr Andoni, Khanh Do Ba, and Piotr Indyk
(Planar) Earth-Mover Distance
• For multisets A, B of points in [∆]2, |A|=|B|=N,
EMD( A, B)  min
 : A B
 a   (a)
aA
i.e., min cost of perfect matching between A and B
EMD( , ) = 6 + 3√2
2
Geometric Representation of EMD
• Map A, B to k-dimensional vectors F(A), F(B)
– Image space of F “simple,” e.g., k small
– Can estimate EMD(A,B) from F(A), F(B) via some
efficient recovery algorithm E
2 Rk
F
E
≈ EMD(A,B)
3
Geometric Representation of EMD:
Motivation
• Visual search and recognition:
– Approximate nearest neighbor under EMD
• Reduces to approximate NN under simpler distances
• Has been applied to fast image search and recognition in
large collections of images [Indyk-Thaper’03, GraumanDarrell’05, Lazebnik-Schmid-Ponce’06]
• Data streaming computation:
– Estimating the EMD between two point sets given as a
stream
• Need mapping F to be linear: adding new point a to A
translates to adding F(a) to F(A)
• Important open problem in streaming [“Kanpur List ’06”]
4
Prior and New Results
Main
Theorem representation of EMD:
Geometric
ForPaper
any ε 2 (0,1), there existsRecovery
a distribution
over linear
mappings
Dimension
Approx.
2
ε
∆
∆
2 of equal size, we can
F: R[Charikar’02,
! R s.t.
for
multisets
A,B
µ
[∆]
Indyk-Thaper’03] ℓ1
O(∆2)
O(log ∆)
produce
an O(1/ε)-approximation
to EMD(A,B)
fromΩ(log
F(A),
F(B)
1/2 ∆)
[Naor-Schechtman’06]
ℓ1
Any
with
probability 2/3.
Our result
Non-norm
O(∆ε)
O(1/ε)
5
Implications
• Streaming:
Paper
Space
Approximation
[Indyk’04]
logO(1)(∆N)
O(log ∆)
Our result
∆ε logO(1)(∆N)
O(1/ε)
* N = number of points
• Approximate nearest neighbor:
Paper
Space
[Andoni-IndykKrauthgamer’09]
s2+ε
Our result
2∆ log(s∆N)
ε
2∆
1/α
O(1)
Query time
Approximation
∆O(1) sε
O((α/ε) loglog s)
(∆ log(s∆N))O(1)
O(1/ε)
* s = number of data points (multisets) to preprocess
α>1 free parameter
6
Proof Outline
• Old [Agarwal-Varadarajan’04, Indyk’07]:
– Extend EMD to EEMD which:
• Handles sets of unequal size |A| · |B| in a grid of side-length k
• EEMD(A,B) = min|S|=|A| and S µ B EMD(A,S) + k¢|B\S|
• Is induced by a norm ||¢||EEMD, i.e., EEMD(A,B) = ||Â(A) – Â(B)||EEMD,
2
where Â(A)2 R∆ is the characteristic vector of A
– Decomposition of EEMD into weighted sum of small EEMD’s
• O(1/ε) distortion
EMD over [∆]2
EEMD over [∆ε]2
+
EEMD over [∆ε]2
EEMD over [∆ε]2
+…+
∆O(1) terms
• New:
– Linear sketching of “sum-norms”
7
Old Idea [Indyk ’07]
EMD over [∆]2
EEMD over [∆ε]2
EEMD over [∆ε]2
+
EEMD over [∆ε]2
+…+
∆O(1) terms
EMD over
[∆]2
EEMD over [∆1/2]2
EEMD over [∆1/2]2
+…+
Old Idea [Indyk ’07]
Solve EEMD in each of ¢ cells,
each a problem in [¢1/2]2
EMD over [∆]2
2
Old Idea [Indyk ’07]
Solve one additional
EEMD problem in [¢1/2]2
2
Should also scale edge
lengths by ¢1/2
Old Idea [Indyk ’07]
• Total cost is the sum of the two phases
• Algorithm outputs a matching, so its cost is at least the EMD
cost
• Indyk shows that if we put a random shift of the [¢1/2]2 grid
on top of the [¢]2 grid, algorithm’s cost is at most a constant
factor times the true EMD cost
• Recursive application gives multiple [¢ε]2 grids on top of each
other, and results in O(1/ε)-approximation
Main New Technical Theorem
||M1||X
||M||1, X =
||M2||X
+
||Mn||X
+…+
For normed space X = (Rt, ||¢||X) and M 2 Xn, denote ||M||1,X = ∑i ||Mi||X.
Given C > 0 and λ > 0, if C/λ · ||M||1, X · C, there is a distribution over linear
mappings
μ: Xn ! X(λ log n)
O(1)
such that we can produce an O(1)-approximation to ||M||1,X from μ(M) w.h.p.
12
Proof Outline: Sum of Norms
• First attempt:
– Sample (uniformly) a few Mi’s to compute ||Mi||X
– Problem: sum could be concentrated in 1 block
M2 contains most of mass
…
M1 M2 M3
…
Mn
• Second attempt:
– Sample Mi w/probability proportional to ||Mi||X [Indyk’07]
– Problem: how to do online?
– Techniques from [JW09, MW10]?
• Need to sample/retrieve blocks, not just individual coordinates
13
Proof Outline: Sum of Norms (cont.)
• Our approach:
– Split into exponential levels:
M = (M1,
M2,
…,
Mn)
S1
1
S2
M4, M7
S3 M1, M3, M8, M9
…
• Assume ||M||1, X · C
• Sk = {i2[n] s.t. ||Mi||X 2 (Tk, 2Tk]}, Tk=C/2k
• Suffices to estimate |Sk| for each level k. How?
– For each level k, subsample from [n] at a rate
such that event Ek (“isolation” of level k)
holds with probability proportional to |Sk|
M2
Sℓ
M5, M10, Mn
M:
Subsample:
Ek?
– Repeat experiment several times, count number
of successes
Y
N
14
Proof Outline: Event Ek
• Ek $ “isolation” of level k:
– Exactly one i 2 Sk gets subsampled
– Nothing from Sk’ for k’<k
• Verification of trial success/failure
– Hash subsampled elements
Subsample:
M1 M4 M5 M6 M9 M11 Mn–1
• Each cell maintains vector sum of
subsampled Mi’s that hash there
– Ek holds roughly (we “accept”) when:
• 1 cell has X-norm in (0.9Tk, 2.1Tk]
• All other cells have X-norm ≤ 0.9Tk
∑ ∑ ∑ ∑
– Check fails only if:
• Elements from lighter levels contribute a lot to 1 cell
• Elements from heavier levels subsampled and collide
– Both unlikely if hash table big enough
– Under-estimates |Sk|. If |Sk| > 2k/polylog(n), gives O(1)-approximation
– Remark: triangle inequality of norm gives control over impact of collisions
15
Sketch and Recovery Algorithm
Sketch:
- For level
everyk,k,create
the estimator
– For each
t hash undertables
estimates |Sk|
– For each hash table:
• Subsample
[n], including
each i2[n] w.p. pk = 2-k
- If |Sk| > from
2k/polylog
n, the estimator
is (|S
• Each
cell maintains
sum of Mi’s that hash to it
k|)
Recovery algorithm:
{
– For each level k, count number ck of “accepting”
hash tables
– Return ∑k Tk · (ck/t) · (1/pk)
16
EMD Wrapup
• We achieve a linear embedding of EMD
– with constant distortion, namely O(1/ε),
– into a space of strongly sublinear dimension, namely ∆ε.
• Open problems:
– Getting (1+ε)-approximation / proving impossibility
– Reducing dimension to logO(1)∆ / proving lower bound
17
What We Did
• We showed that in a data stream, one can sketch
||M||1,X = ∑i ||Mi||X with space about the space
complexity of computing (or sketching) ||¢||X
• This quantity is known as a cascaded norm, written
as L1(X)
• Cascaded norms have many applications [CM, JW]
• Can we generalize this? E.g., what about L2(X), i.e., (∑i
||Mi||2X )1/2
Cascaded Norms [JW09]
• No!
• L2(L1), i.e., (∑i ||Mi||2 1)1/2, requires (n1/2) space, where n is
the number of different i, but sketching complexity of L1 is
O(log n)
• More generally, for p ¸ 1, Lp(L1), i.e., (∑i ||Mi||p 1)1/p is £(n11/p) space
• So, L1(X) is very special
Thank You!
Download