String Edit Distance String Edit Distance • String edit distance measures the similarity between two strings. • Similarity is defined as the minimum number of edits to transform one string into another. • The edit operations are: – Insert, delete, and substitute Edit Operations • Each edit operation has an associated cost. • We assume all edits have a cost of 1. • For example, the distance between “Hello” and “Help” is 2. – One delete and one substitute. • The fewer the edits the more similar the strings. Many Applications • • • • Spell checking. Aligning gene sequences. Computer virus detection. And much more. Step1: Initialization L 0 1 2 3 P 4 H E H E L L O 1 2 3 4 5 Step2: Fill in Matrix • Fill in each cell of the matrix using the following equation: – m[i][j] = min ( m[i][j-1] + 1, m[i-1][j] + 1, m[i-1][j-1] + cost ) • Cost is 0 if the ith character of the first string equals the jth character of the second string. • Cost is 1 if they are not the same. • The distance is in the bottom right cell. Step2: Example H E L L O 1 0 1 2 2 3 4 5 L 0 1 2 3 P 4 3 H E Step2: Example (cont.) H E L L O L 0 1 2 3 1 0 1 2 2 1 0 1 3 2 1 0 4 3 2 1 5 4 3 2 P 4 3 2 1 1 2 H E