String Edit Distance

advertisement
String Edit Distance
String Edit Distance
• String edit distance measures the similarity
between two strings.
• Similarity is defined as the minimum
number of edits to transform one string into
another.
• The edit operations are:
– Insert, delete, and substitute
Edit Operations
• Each edit operation has an associated cost.
• We assume all edits have a cost of 1.
• For example, the distance between “Hello”
and “Help” is 2.
– One delete and one substitute.
• The fewer the edits the more similar the
strings.
Many Applications
•
•
•
•
Spell checking.
Aligning gene sequences.
Computer virus detection.
And much more.
Step1: Initialization
L
0
1
2
3
P
4
H
E
H
E
L
L
O
1
2
3
4
5
Step2: Fill in Matrix
• Fill in each cell of the matrix using the following
equation:
– m[i][j] = min ( m[i][j-1] + 1,
m[i-1][j] + 1,
m[i-1][j-1] + cost )
• Cost is 0 if the ith character of the first string
equals the jth character of the second string.
• Cost is 1 if they are not the same.
• The distance is in the bottom right cell.
Step2: Example
H
E
L
L
O
1
0
1
2
2
3
4
5
L
0
1
2
3
P
4
3
H
E
Step2: Example (cont.)
H
E
L
L
O
L
0
1
2
3
1
0
1
2
2
1
0
1
3
2
1
0
4
3
2
1
5
4
3
2
P
4
3
2
1
1
2
H
E
Download