Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT) 1 Edit Distance For two strings x,y  ∑n ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution Example: ED(0101010, 1010101) = 2 Important in: computational biology, text processing, etc 2 Computing Edit Distance  Problem: compute ed(x,y) for given x,y{0,1}n  Exactly:  O(n2) [Levenshtein’65]  O(n2/log2 n) for |∑|=O(1) [Masek-Paterson’80]  Approximately in n1+o(1) time:  n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Myers’86, BarYossef-Jayram-Krauthgamer-Kumar’04]  Sublinear time:  ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-MagenRaskhodnikova-Rubinfeld-Sami’03] 3 Computing via embedding into ℓ1 Embedding: f:{0,1}n → ℓ1 such that ed(x,y) ≈ ||f(x) - f(y)||1 up to some distortion (=approximation) Can compute ed(x,y) in time to compute f(x) Best embedding by [Ostrovsky-Rabani’05]: distortion = 2Õ(√log n) Computation time: ~n2 randomized (and similar dimension) Helps for nearest neighbor search, sketching, but not computation… 4 Our result Theorem: Can compute ed(x,y) in n*2Õ(√log n) time with 2Õ(√log n) approximation While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding 5 Sketcher’s hat  2 examples of “sketches” from embeddings…  [Johnson-Lindenstrauss]: pick a random ksubspace of Rn, then for any q1,…qnRn, if q̃i is projection of qi, then, w.h.p. ||qi-qj||2 ≈ ||q̃i-q̃j||2 up to O(1) distortion. for k=O(log n)  [Bourgain]: given n vectors qi, can construct n vectors q̃i of k=O(log2 n) dimension such that ||qi-qj||1 ≈ ||q̃i-q̃j||1 up to O(log n) distortion. 6 Our Algorithm x y i z= z[i:i+m]  For each length m in some fixed set L[n], compute vectors vimℓ1 such that ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) Dimension of vim is only O(log2 n)  Vectors {vim} are computed recursively from {vik} corresponding to shorter substrings (smaller kL)  Output: ed(x,y)≈||v1n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|=|y|) 7 Idea: intuition ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] )  How to compute {vim} from {vik} for k<<m ? [OR] show how to compute some {wim} with same property, but which have very high dimension (~m)  Can apply [Bourgain] to vectors {wim}, Obtain vectors {vim} of polylogaritmic dimension Incurs “only” O(log n) distortion at this step of recursion (which turns out to be ok).  Challenge: how to do this in Õ(n) time?! 8 Key step: embeddings of shorter substrings Main Lemma: fix n vectors viℓ1k, of dimension k=O(log2n). Let s<n. Define Ai={vi, vi+1, …, vi+s-1}. Then we can compute vectors qiℓ1k such that ||qi – qj||1≈ EMD(Ai, Aj) up to distortion logO(1) n Computing qi’s takes Õ(n) time. EMD(A,B)=min-cost bipartite matching* * cheating… embeddings of longer substrings* 9 Proof of Main Lemma EMD over n sets Ai O(log2 n) minlow ℓ1high O(1) minlow ℓ1low O(log n) minlow tree-metric O(log3n) sparse graph-metric [Bourgain] (efficient) “low” = logO(1) n Graph-metric: shortest path on a weighted graph Sparse: Õ(n) edges mink M is semi-metric on Mk with “distance” dmin,M(x,y)=mini=1..kdM(xi,yi) O(log n) ℓ1low 10 Step 1 EMD over n sets Ai O(log2 n) minlow ℓ1high q.e.d. 11 minlow ℓ1high Step 2 O(1) minlow ℓ1low  Lemma 2: can embed an n point set from ℓ1H into minO(log n) ℓ1k, for k=log3n, with O(1) distortion.  Use weak dimensionality reduction in ℓ1  Thm [Indyk’06]: Let A be a random* matrix of size H by k=log3n. Then for any x,y, letting x̃=Ax, ỹ=Ay: no contraction: ||x̃-ỹ||1≥||x-y||1 (w.h.p.) 5-expansion: ||x̃-ỹ||1≤5*||x-y||1 (with 0.01 probability)  Just use O(log n) of such embeddings Their min is O(1) approximation to ||x-y||1, w.h.p. 12 Efficiency of Step 1+2 From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into minlow ℓ1low Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai) Save using linearity of sketches: f() is linear: f(A) = ∑aA f(a) Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) Compute f(Ai) in order, for a total of Õ(n) time 13 minlow ℓ1low Step 3 O(log n) minlow tree-metric Lemma 3: can embed ℓ1 over {0..M}p into minlow tree-m, with O(log n) distortion. For each Δ = a power of 2, take O(log n) random grids. Each grid gives a mincoordinate ∞ Δ  14 minlow tree-metric Step 4 O(log3n) sparse graph-metric Lemma 4: suppose have n points in minlow tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D. 15 sparse graph-metric Step 5 O(log n) ℓ1low Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1low with O(log n) distortion in Õ(m) time. Just implement [Bourgain]’s embedding: Choose O(log2 n) sets Bi Need to compute the distance from each node to each Bi For each Bi can compute its distance to each node using Dijkstra’s algorithm in Õ(m) time 16 Summary of Main Lemma EMD over n sets Ai O(log2 n) minlow ℓ1high O(1) oblivious minlow ℓ1low O(log n) min low tree-metric O(log3n) sparse graph-metric non-oblivious Min-product helps to get low dimension (~small-size sketch) bypasses impossibility of dim-reduction in ℓ1 Ok that it is not a metric, as long as it is close to a metric O(log n) ℓ1low 17 Conclusion Theorem: can compute ed(x,y) in n*2Õ(√log n) time with 2Õ(√log n) approximation 18

Approximating Edit Distance in Near-Linear Time

Related documents

Products

Support

Approximating Edit Distance in Near-Linear Time

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib