Lower Bounds for Edit Distance Estimation

advertisement
Polylogarithmic Approximation
for Edit Distance (and the
Asymmetric Query Complexity)
Robert Krauthgamer [Weizmann Institute]
Joint with: Alexandr Andoni [Microsoft SVC]
Krzysztof Onak [CMU]
11011 11011 00111 11011
00111
Polylogarithmic Approximation
for Edit Distance (and the
Asymmetric Query Complexity)
Robert Krauthgamer [Weizmann Institute]
Joint with: Alexandr Andoni [Microsoft SVC]
Krzysztof Onak [CMU]
…
Edit Distance (Levenshtein distance)
Given two strings x,yn:
ed(x,y) = minimum number of character operations
(insertion/deletion/substitution) that transform x to y.
ed( banana ,
ananas ) = 2
Applications:
• Computational Biology
Generic
Search
Engine
• Text processing
• Web search
Polylog. Approx. for ED and the Asymmetric Query Complexity
3
Basic task

Compute ed(x,y) for input x,y  n
O(n2) time [WF’74]

Faster algorithms?
b
a
n
a
n
a
s
a
n
a
n
a
D(i,j) = ed( x[1:i], y[1:j] )
1 1 2 3 4 5
2 2 1 2 3 4
3 2 2 1 2 3
4 3 2 2 1 2
5 4 3 2 2 1
D(i-1, j-1) , if x[i]=y[j]
D(i,j)= min
D(i-1, j) + 1
D(i, j-1) + 1
6 5 4 3 3 2
Polylog. Approx. for ED and the Asymmetric Query Complexity
4
Faster Algorithms?

Compute ed(x,y) for given x,y  n



O(n2) time [WF’74]
O(n2/log2 n) time [MP’80]
Linear time (or near-linear)?
 Specific cases (average, smoothed, restricted input) and variants
(block edit dist etc.) [U’83, LV’85, M’86, GG’88, GP’89, UW’90, CL’90,
CH’98, LMS’98, U’85, CL’92, N’99, CPSV’00, MS’00,CM’02, AK’08, BF’08…]


2Õ(√log n) approximation [OR’05,AO’09], improving earlier ncapproximation [BEKMRRS’03,BJKK’04,BES’06]
Same “barrier” 2Õ(√log n)-approximation also for related tasks:

Nearest neighbor search (text indexing), embedding into normed spaces,
sketching [OR’05]
Polylog. Approx. for ED and the Asymmetric Query Complexity
5
Results I

Theorem 1: Can approximate ed(x,y) within (log n)O(1/ε)
factor in time n1+ε (for any ε>0).

Exponential improvement over previous factor 2Õ(√log n)

Fallout from the study of asymmetric query model …
Polylog. Approx. for ED and the Asymmetric Query Complexity
6
Approach: asymmetric query model

“Compress” one string, x, to nε information


How to compress?



Use dynamic programming to compute ed(x,y) in n1+ε time
Carefully subsample x…
Focus on sample-size (number of
queried positions) in x, for fixed y ?
Obtain near-tight bounds
y
x
Polylog. Approx. for ED and the Asymmetric Query Complexity
7
Results II: Asymmetric Query Complexity


Problem: Decide ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A
Complexity = #queries into x (unlimited access to y)
Approximation:
# Queries:
# queries
(log n)O(1/ε)
[n1/(t+1), n1/t-ε]
O(nε)
O(logt n)
Ω(nε/loglog n)
Ω(logt n)
Θ(logt n)
Θ(log3 n)
Θ(log2 n)
Θ(log n)
1/4 n1/3 1/2-ε n1/2
n
n1/(t+1) n1/t-ε
n
Polylog. Approx. for ED and the Asymmetric Query Complexity
n1-ε
A
8
Upper bound


Theorem 2: can distinguish ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A for
A=(log n)O(1/ε) approximation with nε queries into x (for any ε>0).
Proof structure:
1. Characterize edit by “tree-distance” Txy

Parameter b≥2 (degree)

Txy ≈ ed(x,y) up to 6b*log n factor
b
2. Prune the tree to subsample x
x1 x2
xn
sampled positions in x
Polylog. Approx. for ED and the Asymmetric Query Complexity
9
Step 1: Tree distance
Partition x into b blocks, recursively, for h=logbn levels

x[1:n]
x[⅓n:⅔n]
x[1:⅓n]
…
x[⅔n:n]
x[u:u+⅓n]
x[1] x[2] x[3]
y[1:n]
y[u:u+⅓n]

Ti(s,u) = T-distance between x[s:s+ℓi] and y[u:u+ℓi] where ℓi is the
block-length at level i
Polylog. Approx. for ED and the Asymmetric Query Complexity
10
Tree distance: recursive definition



Recall Ti(s,u) = distance between x[s:s+ℓi] and y[u:u+ℓi]
Base case: Th(s,u)=Hamming(x[s],y[u])
Output: Txy=T0(s=1,u=1)
x[s:s+ℓi]
x
r0
y
y[u:u+ℓi]
Polylog. Approx. for ED and the Asymmetric Query Complexity
11
T-distance approximates edit distance

Lemma: Txy≈ed(x,y) up to 6b*logbn factor.

Hierarchical decomposition inspired by earlier approaches
[BEKMRRS’03, OR’05]

All had approximation recurrence of the type
A(n) = c*A(n/b) + b
for c≥2


Solves to A(n) ≥ 2√log n factor for every choice of b
Our characterization has no multiplicative loss (c=1):

A(n) = A(n/b) + b
Analysis inspired by algorithms for smoothed edit [AK’08]
Polylog. Approx. for ED and the Asymmetric Query Complexity
12
Step 2: Compute the tree distance

For b=2, T-distance gives O(log n) approximation!

BUT know only how to compute T-distance in Õ(n2) time

Instead, for b=(log n)1/ε, can prune the tree to nO(ε) nodes, and get
1+ε approximation

Pruning: subsample (log n)O(1) children out of each node


Works only when ed(x,y) ≥ (n)
Generally, must subsample
the tree non-uniformly, using
the Precision Sampling Lemma
b
sampled positions in x
Polylog. Approx. for ED and the Asymmetric Query Complexity
13
Key tool: non-uniform sampling

Goal:
 For unknown a1, a2, …an[0,1]
 Estimate their sum, up to an additive constant error
 Using only “weak” estimates ã1, ã2, …ãn
Sum Estimator
Adversary
0. fix distribution U
2. pick “precisions” ui
(our algorithm: ui~U i.i.d.)
4. report S̃ =S̃(ã1,…,u1,…)
with |S̃ – ∑ai ̃| < 1.
1. Fix a1,a2,…an (unknown)
3. provide ã1,ã2,…ãn
s.t. |ai-ãi|<1/ui
Polylog. Approx. for ED and the Asymmetric Query Complexity
14
Precision Sampling




Goal: estimate ∑ai from {ãi} s.t. |ai-ãi|<1/ui.
Precision Sampling Lemma: Can achieve WHP
 additive error 1 and multiplicative error 1.5
 with expected precision Eu_i~U[ui]=O(log n).
Inspired by a technique from [IW’05] for streaming (Fk moments)
 In fact, PSL gives simple & improved algorithms for Fk moments,
cascaded (mixed) norms, ℓp-sampling problems [AKO’10]
Also distant relative of Priority Sampling [DLT’07]
Polylog. Approx. for ED and the Asymmetric Query Complexity
15
Precision Sampling for Edit Distance


Apply Precision Sampling to the tree from the characterization
recursively at each node
If a node has very weak precision, can trim the entire sub-tree
Polylog. Approx. for ED and the Asymmetric Query Complexity
16
Lower Bound Theorem

Theorem 3: Achieving approximation A=O(log7 n) for edit distance
requires asymmetric query complexity nΩ(1/loglog n).

I.e., distinguishing ed(x,y)>n/10 vs ed(x,y)<n/10A
Implications:
 First lower bound to expose hardness from repetitiveness in edit
distance
 Contrast with edit on non-repetitive strings (Ulam’s distance)


Empirically easier (better algorithms are known for it)
Yet, all previous lower bounds essentially equivalent for the two variants
[BEKMRRS’03, AN’10, KN’05, KR’06, AK’07, AJP’10]

But asymmetric query complexity:


Ulam: 2-approx. with O(log n) queries [ACCL’04, SS’10]
Edit: requires nΩ(1/loglog n) queries
Polylog. Approx. for ED and the Asymmetric Query Complexity
17
Lower Bound Techniques

Core gadget: ¾(.) = cyclic shift operation


Observation: ed(x,¾j(x)) · 2j
Lower bound outline:


exhibit lower bound via shifts
Amplification by “composing” the hard instance recursively
We will see here:
 Theorem 4: Asymmetric query complexity of approximation n1/2 to
edit distance is Ω(log2 n)
Polylog. Approx. for ED and the Asymmetric Query Complexity
18
The Shift Gadget


Lemma: Ω(log n) query lower bound for approximation A=n0.5.
Hard distribution (x,y):


Fix specific z1, z2{0,1}n (random-looking)
Set:
j
y= 00101


x=
¾ ( 00101 )
) ed(x,y) · 2n0.5 [close]
¾j( 01101 )
) ed(x,y) ¸ n/10 [far]
Formally: y=z1 and x=σj(z1 OR z2) and random j[n0.5]
An algorithm is a set queried positions: Q½[n], |Q|<<log n
 It “reads” (z1 OR z2) at positions Q+j

Claim: Both z1|Q+j and z2|Q+j close to uniform dist. on {0,1}|Q|


up to ~2|Q|/n0.5 statistical distance
Hence |Q| ¸ Ω(log n), even for approximation A=n0.99
Polylog. Approx. for ED and the Asymmetric Query Complexity
19
Amplification via Substitution Product


Ω(log2 n) lower bound by amplification: “compose” two shift
instances
Hard distribution (x,y):




Fix z1,z2{0,1}√n, w0,w1{0,1}√n and y=z1(w0,w1) (substitution)
Choose either z=z1 (close) or z=z2 (far)
x = z(w0,w1) but with random shifts j[n1/3] inside each block and
between blocks
Intuition: must distinguish z=z1 from z=z2

Must “learn” Ω(log n) positions i of z, and each requires reading Ω(log n)
further positions in the corresponding blocks wz[i]
z1=00101
w0= 11011
w1= 00111
x= 11011 11011 00111 11011 00111
Polylog. Approx. for ED and the Asymmetric Query Complexity
20
Towards the Full Theorem

For the full theorem: recursive composition

Proof overview:
1. Define ®-similarity of k distributions
2. ®-similarity ) query lower bound 1/®
(®≈information per query)
(for adaptive algorithms)
3. Initial “Shift metric” has high ®-similarity
4. ®-similarity amplified under substitution product
5. Prove edit distance concentrates well
6. Can reduce large alphabet to binary
(induction basis)
(inductive step)
(requires large alphabet)
(lossy, but done once)
Polylog. Approx. for ED and the Asymmetric Query Complexity
21
Conclusion

We compute ed(x,y) up to (log n)O(1/ε) approximation in n1+ε time

Via Asymmetric Query Complexity (new model)
Open questions:
 Do faster / limitations:


E.g. O(log2n) approximation in n1+o(1) time?
Use these insights for related problems:



Nearest Neighbor Search?
Sublinear-time algorithms (symmetric queries)?
Embeddings? Communication complexity?
Further thoughts:
 Practical ramifications?
 Asymmetric queries model?
 Paradigm for “fast dynamic programming”?
Polylog. Approx. for ED and the Asymmetric Query Complexity
22
Download