Chapter 6: Edit Distance

Dynamic Programming: Edit Distance Outline 1. 2. 3. 4. 5. 6. DNA Sequence Comparison and CF Change Problem Manhattan Tourist Problem Longest Paths in Graphs Sequence Alignment Edit Distance Section 1: DNA Sequence Comparison and CF DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function. • In 1984 Russell Doolittle and colleagues found similarities between a cancercausing gene and the normal growth factor (PDGF) gene. Russell Doolittle http://biology.ucsd.edu/faculty/doolittle.html Cystic Fibrosis • Cystic fibrosis (CF): A chronic and frequently fatal disease which produces an abnormally large amount of mucus. • Mucus is a slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva. • CF primarily affects the respiratory systems of children. http://www.eradimaging.com/site/article.cfm?ID=327 Cystic Fibrosis: Inheritance • In the early 1980s biologists hypothesized that CF is a genetic disorder caused by mutations in an unidentified gene. • Heterozygous carriers are asymptomatic. • Therefore a person must be homozygously recessive in the CF gene in order to be diagnosed with CF. http://www.discern-genetics.org/diagram61.php Cystic Fibrosis: Connection to Other Proteins • Adenosine Triphosphate (ATP): The energy source of all cell processes. • ATP binding proteins are present on the cell membrane and act as transport channels. • In 1989, biologists found a similarity between the cystic fibrosis gene and ATP binding proteins. • This connection was plausible, given the fact that CF involves sweet secretion with abnormally high sodium levels. Cystic Fibrosis: Mutation Analysis • If a high percentage of CF patients have a given mutation in the gene and the normal patients do not, then this could be an indicator of a mutation related to CF. • A certain mutation was in fact found in 70% of CF patients, convincing evidence that it is a predominant genetic diagnostics marker for CF. Cystic Fibrosis and the CFTR Protein • CFTR Protein: A protein of 1480 amino acids that regulates a chloride ion channel. • CFTR adjusts the “wateriness” of fluids secreted by the cell. • Those with cystic fibrosis are missing a single amino acid in their CFTR protein (illustrated on following slide). Cystic Fibrosis and the CFTR Protein Section 2: The Change Problem Bring in the Bioinformaticians • Similarities between a gene with known function and a gene with unknown function allow biologists to infer the function of the gene with unknown function. • We would like to compute a similarity score between two genes to tell how likely it is that they have similar functions. • Dynamic programming is a computing technique for revealing similarities between sequences. • The Change Problem is a good problem to introduce the idea of dynamic programming. Motivating Example: The Change Problem • Say we want to provide change totaling 97 cents. • We could do this in a large number of ways, but the quickest way to do it would be: • Three quarters = 75 cents • Two dimes = 20 cents • Two pennies = 2 cents • Question 1: How do we know that this is quickest? • Question 2: Can we generalize to arbitrary denominations? The Change Problem: Formal Statement • Goal: Convert some amount of money M into given denominations, using the fewest possible number of coins. • Input: An amount of money M, and an array of d denominations c = (c1, c2, …, cd), in decreasing order of value (c1 > c2 > … > cd). • Output: A list of d integers i1, i2, …, id such that c1i1 + c2i2 + … + cdid = M and i1 + i2 + … + id is minimal. The Change Problem: Another Example • Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value? Value Min # of coins 1 2 3 4 5 6 7 8 9 10 1 1 1 • Only one coin is needed to make change for the values 1, 3, and 5. The Change Problem: Another Example • Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value? Value Min # of coins 1 2 3 4 5 6 7 8 9 10 1 2 1 2 1 2 2 2 • Only one coin is needed to make change for the values 1, 3, and 5. • However, two coins are needed to make change for the values 2, 4, 6, 8, and 10. The Change Problem: Another Example • Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value? Value Min # of coins 1 2 3 4 5 6 7 8 9 10 1 2 1 2 1 2 3 2 3 2 • Only one coin is needed to make change for the values 1, 3, and 5. • However, two coins are needed to make change for the values 2, 4, 6, 8, and 10. • Lastly, three coins are needed to make change for 7 and 9. The Change Problem: Recurrence • This example expresses the following recurrence relation: minNumCoinsM 1  1  minNumCoinsM   min minNumCoinsM  3  1  minNumCoinsM  5  1 The Change Problem: Recurrence • In general, given the denominations c: c1, c2, …, cd, the recurrence relation is: minNumCoinsM  c   1 1  minNumCoinsM  c 2   1 minNumCoinsM   min    minNumCoinsM  c d   1 The Change Problem: Pseudocode 1. RecursiveChange(M,c,d) 2. if M = 0 3. return 0 4. bestNumCoins  infinity 5. for i  1 to d 6. if M ≥ ci 7. numCoins  RecursiveChange(M – ci , c, d) 8. if numCoins + 1 < bestNumCoins 9. bestNumCoins  numCoins + 1 10. return bestNumCoins The RecursiveChange Tree: Example • We now will provide the tree of recursive calls if M = 77 and the denominations are 1, 3, and 7. • We will outline all the occurrences of 70 cents to demonstrate how often it is called. The RecursiveChange Tree 77 The RecursiveChange Tree 77 76 74 70 The RecursiveChange Tree 77 76 75 73 74 69 73 71 70 67 69 67 63 The RecursiveChange Tree 77 76 75 74 73 74 72 68 69 73 68 66 62 72 70 66 70 71 67 70 68 64 72 70 66 69 67 68 66 62 66 64 60 63 62 60 56 66 64 60 The RecursiveChange Tree 77 76 75 74 73 74 72 68 69 68 66 62 72 70 66 70 73 ... 71 67 70 68 64 72 70 66 70 70 70 69 68 66 62 66 64 60 70 67 63 62 60 56 66 64 60 ... 70 RecursiveChange: Inefficiencies • As we can see, RecursiveChange recalculates the optimal coin combination for a given amount of money repeatedly. • For our example of M = 77, c = (1,3,7): • The optimal coin combination for 70 cents is computed 9 times! RecursiveChange: Inefficiencies • As we can see, RecursiveChange recalculates the optimal coin combination for a given amount of money repeatedly. • For our example of M = 77, c = (1,3,7): • The optimal coin combination for 70 cents is computed 9 times! • The optimal coin combination for 50 cents is computed billions of times! RecursiveChange: Inefficiencies • As we can see, RecursiveChange recalculates the optimal coin combination for a given amount of money repeatedly. • For our example of M = 77, c = (1,3,7): • The optimal coin combination for 70 cents is computed 9 times! • The optimal coin combination for 50 cents is computed billions of times! • Imagine how many times the optimal coin combination for 3 cents would be calculated… RecursiveChange: Suggested Improvements • We’re re-computing values in our algorithm more than once. • Instead, let’s save results of each computation for all amounts from 0 to M. This way, we can do a reference call to find an already computed value, instead of re-computing each time. • The new algorithm will have running time M*d, where M is the amount of money and d is the number of denominations. • This is an example of the method of dynamic programming. The Change Problem: Dynamic Programming 1. DPChange(M,c,d) 2. 3. 4. 5. 6. 7. 8. 9. bestNumCoins0  0 for m  1 to M bestNumCoinsm  infinity for i  1 to d if m ≥ ci if bestNumCoinsm – ci+ 1 < bestNumCoinsm bestNumCoinsm  bestNumCoinsm – ci+ 1 return bestNumCoinsM DPChange: Example • For example, let us take c = (1,3,7), M = 9: 0 1 2 3 4 5 0 0 0 0 1 1 1 0 2 3 1 1 2 1 2 3 2 1 2 1 2 3 2 1 2 0 1 2 3 4 5 6 7 8 0 1 2 3 0 1 0 1 2 3 4 5 6 7 0 1 2 0 2 0 1 2 3 4 5 6 0 0 1 2 0 1 1 2 1 2 3 2 1 2 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 0 0 1 2 1 2 1 2 1 2 3 2 1 2 3 Quick Note on Dynamic Programming • You may have noticed that the dynamic programming algorithm provided somewhat resembles the recursive algorithm we already had. • The difference is that with recursion, we had constant repetition since we proceeded from more complicated sums “down the tree” to less complicated ones. • With DPChange, we always “build up” from easier problem instances to the desired one, and in so doing avoid repetition. Section 3: Manhattan Tourist Problem Additional Example: Manhattan Tourist Problem • Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right. Hotel Station Additional Example: Manhattan Tourist Problem • Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right. • You are leaving town, and you want to see as many attractions (represented by *) as possible. Hotel * * * * * * * * * * * *Station Additional Example: Manhattan Tourist Problem • Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right. • You are leaving town, and you want to see as many attractions (represented by *) as possible. Hotel * * * • Your time is limited: you only have time to travel east and south. * * * * * * * * *Station Additional Example: Manhattan Tourist Problem • Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right. • You are leaving town, and you want to see as many attractions (represented by *) as possible. Hotel * * * • Your time is limited: you only have time to travel east and south. • What is the best path through town? * * * * * * * * *Station Additional Example: Manhattan Tourist Problem • Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right. • You are leaving town, and you want to see as many attractions (represented by *) as possible. Hotel * * * • Your time is limited: you only have time to travel east and south. • What is the best path through town? * * * * * * * * *Station Manhattan Tourist Problem (MTP): Formulation • Goal: Find the longest path in a weighted grid. • Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink.” • Output: A longest path in G from “source” to “sink.” MTP Greedy Algorithm • Our first try at solving the MTP will use a greedy algorithm. • Main Idea: At each node (intersection), choose the edge (street) departing that node which has the greatest weight. MTP Greedy Algorithm: Example 0 source 1 3 2 3 2 4 4 0 j coordinate 0 1 3 1 i coordinate 0 0 2 4 5 5 4 2 0 8 6 1 3 3 3 1 2 5 4 3 2 4 7 3 3 4 2 6 4 4 2 1 2 3 5 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 6 4 3 3 5 4 5 1 3 1 4 2 0 8 6 2 3 3 j coordinate 3 2 5 4 0 4 7 4 4 2 2 0 2 4 0 4 3 2 3 1 2 1 2 3 5 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 2 0 2 4 5 1 4 2 0 8 3 2 3 3 j coordinate 3 2 5 6 1 4 5 4 4 0 4 7 3 3 4 2 6 4 3 2 0 3 1 4 3 2 1 2 3 5 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 4 5 1 4 2 0 8 3 2 3 3 j coordinate 3 2 5 6 1 4 5 4 4 0 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 1 2 3 5 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 4 5 1 1 4 2 0 8 3 3 2 5 6 0 j coordinate 2 3 3 4 4 5 4 9 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 1 2 3 5 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 4 5 5 1 3 j coordinate 3 2 1 4 2 0 8 6 13 3 3 0 2 5 4 9 4 4 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 1 2 3 5 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 4 5 5 3 13 15 0 2 1 4 1 2 3 5 2 j coordinate 3 2 8 6 1 3 3 0 2 5 4 9 4 4 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 4 5 5 3 13 15 0 2 1 4 19 1 2 3 5 2 j coordinate 3 2 8 6 1 3 3 0 2 5 4 9 4 4 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 4 5 5 3 13 15 0 2 1 4 19 1 2 20 3 5 2 j coordinate 3 2 8 6 1 3 3 0 2 5 4 9 4 4 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 2 sink MTP Greedy Algorithm: Example 0 source 0 0 1 3 1 i coordinate 4 5 5 3 13 15 0 2 1 4 19 1 2 20 3 5 2 j coordinate 3 2 8 6 1 3 3 0 2 5 4 9 4 4 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 2 23 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 1 4 3 5 0 0 5 5 1 5 0 5 0 2 0 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 1 4 3 5 0 0 5 5 1 5 0 5 0 2 0 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 1 4 3 5 0 0 5 5 1 5 0 5 0 2 0 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 1 4 3 5 0 0 5 5 1 5 0 5 0 2 0 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 1 4 3 5 0 0 5 5 1 5 0 5 0 2 0 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 1 4 3 5 0 0 5 5 1 5 0 5 0 2 0 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 1 4 3 5 0 0 5 5 1 5 0 5 0 2 0 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 0 1 4 3 5 0 0 5 5 1 5 0 This weight heavily influences the choice of optimal path. 5 2 0 22 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 0 1 4 3 5 0 0 5 5 1 5 0 This weight heavily influences the choice of optimal path. 5 2 0 22 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 0 1 4 3 5 0 0 5 5 1 5 0 This weight heavily influences the choice of optimal path. 5 2 0 22 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 0 1 4 3 5 0 0 5 5 1 5 0 This weight heavily influences the choice of optimal path. 5 2 0 22 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 0 1 4 3 5 0 0 5 5 1 5 0 This weight heavily influences the choice of optimal path. 5 2 0 22 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 0 1 4 3 5 0 0 5 5 1 5 0 This weight heavily influences the choice of optimal path. 5 2 0 22 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 3 0 1 4 3 5 0 0 5 5 1 5 0 This weight heavily influences the choice of optimal path. 5 2 0 22 sink MTP Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 2 5 5 1 5 This weight heavily influences the choice of optimal path. 5 3 1 4 3 Optimal path 5 0 0 0 0 2 0 18 22 sink MTP: Simple Recursive Program MT(n,m) if n=0 or m=0 return MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} MTP: Simple Recursive Program 1. MT(n,m) 2. if n=0 or m=0 3. return MT(n,m) 4. x  MT(n-1,m)+ length of the edge from (n1,m) to (n,m) 5. y  MT(n,m-1)+ length of the edge from (n,m1) to (n,m) 6. return max{x,y} • What’s wrong with this approach? MTP: Dynamic Programming 0 source 1 0 i 1 1 S0,1 = 1 5 1 5 S1,0 = 5 • We calculate the optimal path score for each vertex in the graph. • A given vertex’s score is the maximum sum of incoming edge weight and prior vertex’s score (along that incoming edge). MTP: Dynamic Programming j 0 source 1 1 0 i 2 1 5 3 -5 1 5 3 2 2 8 S2,0 = 8 4 S1,1 = 4 3 S0,2 = 3 • Gold edges represent edges selected by the algorithm as maximal. MTP: Dynamic Programming j 0 source 1 1 0 i 5 10 -5 1 1 5 4 5 3 -5 2 3 3 8 0 8 S3,0 = 8 9 S2,1 = 9 • Gold edges represent edges selected by the algorithm as 8 maximal. S3,0 = 8 3 2 1 5 3 2 13 S1,2 = 13 MTP: Dynamic Programming j 0 source 1 1 0 i 5 13 5 -3 3 -5 8 9 0 0 0 8 -5 9 S3,1 = 9 • Gold edges represent edges selected by the algorithm as maximal. -5 4 3 8 10 1 5 2 3 3 -5 1 3 2 1 5 3 2 12 S2,2 = 12 8 S1,3 = 8 MTP: Dynamic Programming j 0 source 1 1 0 i 5 13 5 3 9 12 0 -5 0 8 2 3 8 8 -3 -5 0 -5 0 9 • Gold edges represent edges selected by the algorithm as maximal. -5 4 3 8 10 1 5 2 3 3 -5 1 3 2 1 5 3 2 9 S3,2 = 9 15 S2,3 = 15 MTP: Dynamic Programming j 0 source 1 1 0 i 5 13 5 3 9 12 0 15 -5 0 8 2 3 8 8 -3 -5 0 -5 1 0 0 9 • Gold edges represent edges selected by the algorithm as maximal. -5 4 3 8 10 1 5 2 3 3 -5 1 3 2 1 5 3 2 9 16 S3,3 = 16 MTP: Dynamic Programming j 0 source 1 1 0 i 5 13 5 3 9 12 0 15 -5 0 8 2 3 8 8 -3 -5 0 -5 -5 4 3 8 10 1 5 2 3 3 -5 1 3 2 1 5 3 2 1 0 0 9 9 16 • Gold edges represent edges selected by the algorithm as maximal. • Once we reach the sink, we backtrack along gold edges to the source to find the optimal (green) path.  MTP: Running Time with Dynamic Programming • The score si, j for a point (i,j) is given by the recurrence:  si1, j  weight of edge between i 1, j  and i, j  si, j  max   si, j 1 +weight of edge between (i, j 1) and (i, j) • The running time is n x m for an n by m grid. • (n = # of rows, m = # of columns) Traversing the Manhattan Grid • So that we don’t repeat ourselves, we need a strategy for actually traversing through the Manhattan grid to calculate the distances. • Three common strategies: a) Column by column b) Row by row c) Along diagonals a) c) b) Manhattan Is Not a Rectangular Grid • What about diagonals? A2 A3 A1 B • The score at point B is given by the recurrence: s  weight of edge A ,B A1 1   sB  max sA 2  weight of edge A2 ,B   sA 3  weight of edge A3 ,B Section 4: Longest Path in a Graph Recursion for an Arbitrary Graph • We would like to compute the score for point x in an arbitrary graph. • Let Predecessors(x) be the set of vertices with edges leading into x. Then the recurrence is given by: sx  max s y in Predecessors(x ) y  weight of vertex y, x  • The running time for a graph with E edges is O(E), since each edge is evaluated once. Recursion for an Arbitrary Graph: Problem • The only hitch is that we must decide on the order in which we visit the vertices. • By the time the vertex x is analyzed, the values sy for all its predecessors y should already be computed. • If the graph has a cycle, we will get stuck in the pattern of going over and over the same cycle. • In the Manhattan graph, we escaped this problem by requiring that we could only move east or south. This is what we would like to generalize… Some Graph Theory Terminology • Directed Acyclic Graph (DAG): A graph in which each edge is provided an orientation, and which has no cycles. • We represent the edges of a DAG with directed arrows. • In the following example, we can move along the edge from B to C, but not from C to B. • Note that BCE does not form a cycle, since we cannot travel from B to C to E and back to B. http://commons.wikimedia.org/wiki/File:Directed_acyclic_graph.svg Some Graph Theory Terminology • Topological Ordering: A labeling of the vertices of a DAG (from 1 to n, say) such that every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label. • In other words, if vertices are positioned on a line in an increasing order, then all edges go from left to right. • Theorem: Every DAG has a topological ordering. • What this means: Every DAG has a source node (1) and a sink node (n). Topological Ordering: Example • A superhero’s costume can be represented by a DAG: he can’t put his boots on before his tights! • He also would like an understandable representation of the graph, so that he can dress quickly. Topological Ordering: Example • Here are two different topological orderings of his DAG: Longest Path in a DAG: Formulation • Goal: Find a longest path between two vertices in a weighted DAG. • Input: A weighted DAG G with source and sink vertices. • Output: A longest path in G from source to sink. • Note: Now we know that we can apply a topological ordering to G, and then use dynamic programming to find the longest path in G. Section 5: Sequence Alignment Back to Biology: Sequence Alignment • Recall that our original problem was to fit a similarity score on two DNA sequences: • To do this we will use what is called an alignment matrix. Alignment Matrix: Example • Given 2 DNA sequences v and w of length m and n: v: ATCTGAT w: TGCATA m=7 n=6 • Example of Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 2 insertions 2 deletions Common Subsequence • Given two sequences v = v1 v2…vm and w = w1 w2…wn a common subsequence of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that the it -th letter of v is equal to the jt-th letter of w. • Example: v = ATGCCAT, w = TCGGGCTATC. Then take: • i1 = 2, i2 = 3, i3 = 6, i4 = 7 • j1 = 1, j2 = 3, j3 = 8, j4 = 9 • This gives us that the common subsequence is TGAT. Longest Common Subsequence • Given two sequences v = v1 v2…vm and w = w1 w2…wn the Longest Common Subsequence (LCS) of v and w is a sequence of positions in v: 1 < i1 < i2 < … < iT < m and a sequence of positions in w: 1 < j1 < j2 < … < jT < n such that the it -th letter of v is equal to jt-th letter of w and T is maximal. • Example: v = ATGCCAT, w = TCGGGCTATC. • Before we found that TGAT is a subsequence. • The longest subsequence is TGCAT. • But…how do we find the LCS of two sequences? Edit Graph for LCS Problem • Assign one sequence to the rows, and one to the columns. i 0 T1 G2 C3 A4 T5 A6 C7 j A T C T G A T C 0 1 2 3 4 5 6 7 8 Edit Graph for LCS Problem • Assign one sequence to the i 0 rows, and one to T1 the columns. G2 • Every diagonal edge represents a C3 match of elements. A4 T5 A6 C7 j A T C T G A T C 0 1 2 3 4 5 6 7 8 Edit Graph for LCS Problem • Assign one sequence to the i 0 rows, and one to T1 the columns. G2 • Every diagonal edge represents a C3 match of elements. • Therefore, in a path A 4 from source to sink, T 5 the diagonal edges A6 represent a common C7 subsequence. j A T C T G A T C 0 1 2 3 4 5 6 7 8 Common Subsequence: TGAT Edit Graph for LCS Problem • LCS Problem: Find a path with the i 0 maximum number of diagonal edges. T 1 j A T C T G A T C 0 1 2 3 4 5 6 7 8 G2 C3 A4 T5 A6 C7 Common Subsequence: TGAT Computing the LCS: Dynamic Programming • Let vi = prefix of v of length i: v1 … vi • and wj = prefix of w of length j: w1 … wj • The length of LCS(vi,wj) is computed by: s  i1, j si, j  max si, j 1  si1, j 1 1 if v i  w j Section 6: Edit Distance Hamming Distance • The Hamming Distance dH(v, w) between two DNA sequences v and w of the same length is equal to the number of places in which the two sequences differ. • Example: Given as follows, dH(v, w) = 8: v: ATATATAT w: TATATATA • However, note that these sequences are still very similar. • Hamming Distance is therefore not an ideal similarity score, because it ignores insertions and deletions. Edit Distance • Levenshtein (1966) introduced the edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) needed to transform one string into the other d(v,w) = MIN number of elementary operations to transform v  w Edit Distance: Example 1 • So in our previous example, we can shift w one nucleotide to the right, and see that w is obtained from v by one insertion and one deletion: v: ATATATATw: -TATATATA • Hence the edit distance, d(v, w) = 2. • Note: In order to provide this distance, we had to “fiddle” with the sequences. Hamming distance was easier to find. Edit Distance: Example 2 • We can transform TGCATAT  ATCCGAT in 5 steps: TGCATAT Edit Distance: Example 2 • We can transform TGCATAT  ATCCGAT in 5 steps: TGCATAT (delete last T) Edit Distance: Example 2 • We can transform TGCATAT  ATCCGAT in 5 steps: TGCATAT TGCATA (delete last T) (delete last A) Edit Distance: Example 2 • We can transform TGCATAT  ATCCGAT in 5 steps: TGCATAT TGCATA ATGCAT (delete last T) (delete last A) (insert A at front) Edit Distance: Example 2 • We can transform TGCATAT  ATCCGAT in 5 steps: TGCATAT TGCATA ATGCAT ATCCAT (delete last T) (delete last A) (insert A at front) (substitute C for G) Edit Distance: Example 2 • We can transform TGCATAT  ATCCGAT in 5 steps: TGCATAT TGCATA ATGCAT ATCCAT ATCCGAT (delete last T) (delete last A) (insert A at front) (substitute C for G) (insert G before last A) Edit Distance: Example 2 • We can transform TGCATAT  ATCCGAT in 5 steps: TGCATAT TGCATA ATGCAT ATCCAT ATCCGAT (delete last T) (delete last A) (insert A at front) (substitute C for G) (insert G before last A) • Note: This only allows us to conclude that the edit distance is at most 5. Edit Distance: Example 2 • Now we transform TGCATAT  ATCCGAT in 4 steps: ATGCATAT Edit Distance: Example 2 • Now we transform TGCATAT  ATCCGAT in 4 steps: ATGCATAT (insert A at front) Edit Distance: Example 2 • Now we transform TGCATAT  ATCCGAT in 4 steps: ATGCATAT ATGCATAT (insert A at front) (delete second T) Edit Distance: Example 2 • Now we transform TGCATAT  ATCCGAT in 4 steps: ATGCATAT ATGCATAT ATGCGAT (insert A at front) (delete second T) (substitute G for A) Edit Distance: Example 2 • Now we transform TGCATAT  ATCCGAT in 4 steps: ATGCATAT ATGCATAT ATGCGAT ATCCGAT (insert A at front) (delete second T) (substitute G for A) (substitute C for G) Edit Distance: Example 2 • Now we transform TGCATAT  ATCCGAT in 4 steps: ATGCATAT ATGCATAT ATGCGAT ATCCGAT (insert A at front) (delete second T) (substitute G for A) (substitute C for G) • Can we do even better? 3 steps? 2 steps? How can we know? Key Result • Theorem: Given two sequences v and w of length m and n, the edit distance d(v,w) is given by d(v,w) = m + n – s(v,w), where s(v,w) is the length of the longest common subsequence of v and w. • This is great news, because it means that if solving the LCS problem for v and w is equivalent to finding the edit distance between them. • Therefore, we will solve the LCS problem instead in the following slides. Return to the Edit Graph • Every alignment corresponds to a path from source to sink. Return to the Edit Graph • Every alignment corresponds to a path from source to sink. • Horizontal and vertical edges correspond to indels (deletions and insertions). Return to the Edit Graph • Every alignment corresponds to a path from source to sink. • Horizontal and vertical edges correspond to indels (deletions and insertions). • Diagonal edges correspond to matches and mismatches. Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 • Edit graph path: 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 • Edit graph path: (0,0) 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 • Edit graph path: (0,0)(1,1) 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) (2,3) Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) (2,3)(3,4) Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) (2,3)(3,4)etc. Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) (2,3)(3,4)etc. Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) (2,3)(3,4)etc. Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) (2,3)(3,4)etc. Alignment as a Path in the Edit Graph: Example • Suppose that our sequences are ATCGTAC, ATGTTAT. • One possible alignment is: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 • Edit graph path: (0,0)(1,1)(2,2) (2,3)(3,4)etc. Alignment with Dynamic Programming • Initialize 0th row and 0th column to be all zeroes. Alignment with Dynamic Programming • Initialize 0th row and 0th column to be all zeroes. • Use the following recursive formula to calculate Si,j for each i, j: Si-1, j-1 + 1 Si, j = max Si-1, j Si, j-1 Dynamic Programming Example • Note that the placement of the arrows shows where a given score originated from: • if from the top • if from the left • if vi = wj Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. • As we can see, we do not simply add 1 each time. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. • As we can see, we do not simply add 1 each time. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. • As we can see, we do not simply add 1 each time. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. • As we can see, we do not simply add 1 each time. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. • As we can see, we do not simply add 1 each time. Dynamic Programming Example • Continuing with the dynamic programming algorithm fills the table. • We first look for matches, highlighted in red, and then we fill in the rest of that row and column. • As we can see, we do not simply add 1 each time. Dynamic Alignment: Pseudocode 1. LCS(v,w) 2. for i  1 to n 3. si,0  0 4. for j  1 to m 5. s0,j  0 6. for i  1 to n 7. for j  1 to m 8. si-1,j 9. si,j  max si,j-1 10. si-1,j-1 + 1, if vi = wj 11. “ “ if si,j = si-1,j • bi,j  “ “ if si,j = si,j-1 • “ “ if si,j = si-1,j-1 + 1 • return (sn,m, b) Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 6 A A 6 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 5 T _ 5 6 A A 6 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 4 T T 5 5 T _ 5 6 A A 6 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 _ C 6 7 T _ 7 Finding an Optimal Alignment • LCS(v,w) created the alignment grid. • Follow the arrows backwards from the sink to the source to obtain the path corresponding to an optimal alignment: 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 _ C 6 7 T _ 7 Printing LCS: Backtracking 1. PrintLCS(b,v,i,j) 2. if i = 0 or j = 0 3. return 4. if bi,j = “ “ 5. PrintLCS(b,v,i-1,j-1) 6. print vi 7. else 8. if bi,j = “ “ 9. PrintLCS(b,v,i-1,j) 10. else 11. PrintLCS(b,v,i,j-1) LCS: Runtime • It takes O(nm) time to fill in the nxm dynamic programming matrix: the pseudocode consists of a nested “for” loop inside of another “for” loop. • This is a very positive result for a problem that appeared much more difficult initially.

Chapter 6: Edit Distance

Related documents

Products

Support

Chapter 6: Edit Distance

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib