An Extension of the String-to-String Correction Problem Roy

advertisement
An Extension of the String-toString Correction Problem
Roy Lowrance and Robert A. Wagner
Journal of the ACM, vol. 22, No. 2, April 1975, pp. 177-183.
Speaker: 吳展碩
Edit Distance
• Three edit operations:
– Substitution
• abcd -> aacd (change b to a)
– Insertion
• abcd -> abacd (insert an a)
– Deletion
• abcd -> abd (delete c)
• Given two strings T and P, The problem is to determine
the minimum number of edit operations to transform T
into P.
Note: For clarity, we consider the cost of all edit operations are same.
d[i, j] = min( d[i-1, j] + 1,
d[i, j-1] + 1,
d[i-1, j-1] + cost(A[i]->B[j])
)
s a t u r d a y
0 1 2 3 4 5 6 7 8
s 1 0 1 2 3 4 5 6 7
u 2 1 1 2 2 3 4 5 6
n 3 2 2 2 3 3 4 5 6
d 4 3 3 3 3 4 3 4 5
a 5 4 3 4 4 4 4 3 4
y 6 5 4 4 5 5 5 4 3
This example is copied from Wikipedia
s a t u r d a y
s u n d a y
The Problem
• This paper extends the set of edit
operations to include the operation of
interchanging two adjacent characters.
– Swap
• Example:
T: a b c d
P: c d a
a b c d -> a c d -> c a d -> c d a
Trace
• A trace is a graphical specification of how
edit operations apply to each character in
the two strings.
• Example:
T: a b c d
P: c d a
Important Properties
• The edit operations in following cases can
be substituted by other edit operations.
abc
a ... b
a ... a ... b
bca
b ... c
b ... ... ... a
abc
abc
bca
bca
2  swaps
insertion + deletion
a ... b
a ... b
b ... c
b ... c
swap + substitution
or
deletion + substitution
K
a ... b
b ... c
2  substitution
a ... a ... b
a ... a ... b
b ... ... ... a
b ... ... ... a
L
swap + Kdeletion + Linsertion
a trace with lower cost
The Algorithm
i'
i’ i
............ a ... b
j'
........ b ...... a
j’
j
j
d[i, j] = min( d[i-1, j] + 1,
d[i, j-1] + 1,
d[i-1, j-1] + cost(A[i]->B[j]),
d[i'-1, j'-1] + (i-i'-1) + (j-j'-1) + 1
)
i
Summary
• With a simple preprocessing on |T| and |P|, then
the problem can be solved by dynamic
programming in time O(|T||P|).
• If we allow edit operations to have different cost
Insertion (cost WI)
Deletion (cost WD)
Swap (cost WS)
Substitution (cost WC)
then the algorithm works if 2  WS ≥ WI + WD.
Download