Chapter 6: Edit Distance

advertisement
Dynamic Programming:
Edit Distance
Outline
1.
2.
3.
4.
5.
6.
DNA Sequence Comparison and CF
Change Problem
Manhattan Tourist Problem
Longest Paths in Graphs
Sequence Alignment
Edit Distance
Section 1:
DNA Sequence
Comparison and CF
DNA Sequence Comparison: First Success Story
• Finding sequence similarities with genes
of known function is a common approach
to infer a newly sequenced gene’s
function.
• In 1984 Russell Doolittle and colleagues
found similarities between a cancercausing gene and the normal growth factor
(PDGF) gene.
Russell Doolittle
http://biology.ucsd.edu/faculty/doolittle.html
Cystic Fibrosis
• Cystic fibrosis (CF): A chronic and
frequently fatal disease which
produces an abnormally large
amount of mucus.
• Mucus is a slimy material that
coats many epithelial surfaces
and is secreted into fluids such as
saliva.
• CF primarily affects the respiratory
systems of children.
http://www.eradimaging.com/site/article.cfm?ID=327
Cystic Fibrosis: Inheritance
• In the early 1980s biologists
hypothesized that CF is a genetic
disorder caused by mutations in an
unidentified gene.
• Heterozygous carriers are
asymptomatic.
• Therefore a person must be
homozygously recessive in the CF
gene in order to be diagnosed with
CF.
http://www.discern-genetics.org/diagram61.php
Cystic Fibrosis: Connection to Other Proteins
• Adenosine Triphosphate (ATP): The energy source of all
cell processes.
• ATP binding proteins are present on the cell membrane and act
as transport channels.
• In 1989, biologists found a similarity between the cystic
fibrosis gene and ATP binding proteins.
• This connection was plausible, given the fact that CF involves
sweet secretion with abnormally high sodium levels.
Cystic Fibrosis: Mutation Analysis
• If a high percentage of CF patients have a given mutation in
the gene and the normal patients do not, then this could be an
indicator of a mutation related to CF.
• A certain mutation was in fact found in 70% of CF patients,
convincing evidence that it is a predominant genetic
diagnostics marker for CF.
Cystic Fibrosis and the CFTR Protein
• CFTR Protein: A protein of 1480 amino acids that regulates a
chloride ion channel.
• CFTR adjusts the “wateriness” of fluids secreted by the cell.
• Those with cystic fibrosis are missing a single amino acid in
their CFTR protein (illustrated on following slide).
Cystic Fibrosis and the CFTR Protein
Section 2: The Change
Problem
Bring in the Bioinformaticians
• Similarities between a gene with known function and a gene
with unknown function allow biologists to infer the function of
the gene with unknown function.
• We would like to compute a similarity score between two
genes to tell how likely it is that they have similar functions.
• Dynamic programming is a computing technique for revealing
similarities between sequences.
• The Change Problem is a good problem to introduce the idea
of dynamic programming.
Motivating Example: The Change Problem
• Say we want to provide change totaling 97 cents.
• We could do this in a large number of ways, but the quickest
way to do it would be:
• Three quarters = 75 cents
• Two dimes = 20 cents
• Two pennies = 2 cents
• Question 1: How do we know that this is quickest?
• Question 2: Can we generalize to arbitrary denominations?
The Change Problem: Formal Statement
• Goal: Convert some amount of money M into given
denominations, using the fewest possible number of coins.
• Input: An amount of money M, and an array of d
denominations c = (c1, c2, …, cd), in decreasing order of
value (c1 > c2 > … > cd).
• Output: A list of d integers i1, i2, …, id such that
c1i1 + c2i2 + … + cdid = M
and i1 + i2 + … + id is minimal.
The Change Problem: Another Example
• Given the denominations 1, 3, and 5, what is the minimum
number of coins needed to make change for a given value?
Value
Min # of coins
1 2 3 4 5 6 7 8 9 10
1
1
1
• Only one coin is needed to make change for the values 1, 3,
and 5.
The Change Problem: Another Example
• Given the denominations 1, 3, and 5, what is the minimum
number of coins needed to make change for a given value?
Value
Min # of coins
1 2 3 4 5 6 7 8 9 10
1 2 1 2 1 2
2
2
• Only one coin is needed to make change for the values 1, 3,
and 5.
• However, two coins are needed to make change for the values
2, 4, 6, 8, and 10.
The Change Problem: Another Example
• Given the denominations 1, 3, and 5, what is the minimum
number of coins needed to make change for a given value?
Value
Min # of coins
1 2 3 4 5 6 7 8 9 10
1 2 1 2 1 2
3 2
3
2
• Only one coin is needed to make change for the values 1, 3,
and 5.
• However, two coins are needed to make change for the values
2, 4, 6, 8, and 10.
• Lastly, three coins are needed to make change for 7 and 9.
The Change Problem: Recurrence
• This example expresses the following recurrence relation:
minNumCoinsM 1  1

minNumCoinsM   min minNumCoinsM  3  1

minNumCoinsM  5  1
The Change Problem: Recurrence
• In general, given the denominations c: c1, c2, …, cd, the
recurrence relation is:
minNumCoinsM  c   1
1

minNumCoinsM  c 2   1
minNumCoinsM   min 


minNumCoinsM  c d   1
The Change Problem: Pseudocode
1. RecursiveChange(M,c,d)
2.
if M = 0
3.
return 0
4.
bestNumCoins  infinity
5.
for i  1 to d
6.
if M ≥ ci
7.
numCoins  RecursiveChange(M – ci , c, d)
8.
if numCoins + 1 < bestNumCoins
9.
bestNumCoins  numCoins + 1
10. return bestNumCoins
The RecursiveChange Tree: Example
• We now will provide the tree of recursive calls if M = 77 and
the denominations are 1, 3, and 7.
• We will outline all the occurrences of 70 cents to demonstrate
how often it is called.
The RecursiveChange Tree
77
The RecursiveChange Tree
77
76
74
70
The RecursiveChange Tree
77
76
75
73
74
69
73
71
70
67
69
67
63
The RecursiveChange Tree
77
76
75
74
73
74 72 68
69
73
68 66 62
72 70 66
70
71
67
70 68 64
72 70 66
69
67
68 66 62
66 64 60
63
62 60 56
66 64 60
The RecursiveChange Tree
77
76
75
74
73
74 72 68
69
68 66 62
72 70 66
70
73
...
71
67
70 68 64
72 70 66
70
70
70
69
68 66 62
66 64 60
70
67
63
62 60 56
66 64 60
...
70
RecursiveChange: Inefficiencies
• As we can see, RecursiveChange recalculates the optimal coin
combination for a given amount of money repeatedly.
• For our example of M = 77, c = (1,3,7):
• The optimal coin combination for 70 cents is computed 9
times!
RecursiveChange: Inefficiencies
• As we can see, RecursiveChange recalculates the optimal coin
combination for a given amount of money repeatedly.
• For our example of M = 77, c = (1,3,7):
• The optimal coin combination for 70 cents is computed 9
times!
• The optimal coin combination for 50 cents is computed
billions of times!
RecursiveChange: Inefficiencies
• As we can see, RecursiveChange recalculates the optimal coin
combination for a given amount of money repeatedly.
• For our example of M = 77, c = (1,3,7):
• The optimal coin combination for 70 cents is computed 9
times!
• The optimal coin combination for 50 cents is computed
billions of times!
• Imagine how many times the optimal coin combination for
3 cents would be calculated…
RecursiveChange: Suggested Improvements
• We’re re-computing values in our algorithm more than once.
• Instead, let’s save results of each computation for all amounts
from 0 to M. This way, we can do a reference call to find an
already computed value, instead of re-computing each time.
• The new algorithm will have running time M*d, where M is
the amount of money and d is the number of denominations.
• This is an example of the method of dynamic programming.
The Change Problem: Dynamic Programming
1. DPChange(M,c,d)
2.
3.
4.
5.
6.
7.
8.
9.
bestNumCoins0  0
for m  1 to M
bestNumCoinsm  infinity
for i  1 to d
if m ≥ ci
if bestNumCoinsm – ci+ 1 < bestNumCoinsm
bestNumCoinsm  bestNumCoinsm – ci+ 1
return bestNumCoinsM
DPChange: Example
• For example, let us take
c = (1,3,7), M = 9:
0 1 2 3 4 5
0
0
0
0 1
1
1
0
2
3
1
1
2
1
2
3
2
1
2
1
2
3
2
1
2
0 1 2 3 4 5 6 7 8
0 1 2 3
0
1
0 1 2 3 4 5 6 7
0 1 2
0
2
0 1 2 3 4 5 6
0
0
1
2
0
1
1
2
1
2
3
2
1
2
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9
0
0
1
2
1
2
1
2
1
2
3
2
1
2
3
Quick Note on Dynamic Programming
• You may have noticed that the dynamic programming
algorithm provided somewhat resembles the recursive
algorithm we already had.
• The difference is that with recursion, we had constant
repetition since we proceeded from more complicated sums
“down the tree” to less complicated ones.
• With DPChange, we always “build up” from easier problem
instances to the desired one, and in so doing avoid repetition.
Section 3:
Manhattan Tourist
Problem
Additional Example: Manhattan Tourist Problem
• Imagine that you are a tourist in
Manhattan, whose streets are
represented by the grid on the right.
Hotel
Station
Additional Example: Manhattan Tourist Problem
• Imagine that you are a tourist in
Manhattan, whose streets are
represented by the grid on the right.
• You are leaving town, and you want
to see as many attractions
(represented by *) as possible.
Hotel
*
*
*
*
*
*
*
*
*
*
*
*Station
Additional Example: Manhattan Tourist Problem
• Imagine that you are a tourist in
Manhattan, whose streets are
represented by the grid on the right.
• You are leaving town, and you want
to see as many attractions
(represented by *) as possible.
Hotel
*
*
*
• Your time is limited: you only have
time to travel east and south.
*
*
*
*
*
*
*
*
*Station
Additional Example: Manhattan Tourist Problem
• Imagine that you are a tourist in
Manhattan, whose streets are
represented by the grid on the right.
• You are leaving town, and you want
to see as many attractions
(represented by *) as possible.
Hotel
*
*
*
• Your time is limited: you only have
time to travel east and south.
• What is the best path through town?
*
*
*
*
*
*
*
*
*Station
Additional Example: Manhattan Tourist Problem
• Imagine that you are a tourist in
Manhattan, whose streets are
represented by the grid on the right.
• You are leaving town, and you want
to see as many attractions
(represented by *) as possible.
Hotel
*
*
*
• Your time is limited: you only have
time to travel east and south.
• What is the best path through town?
*
*
*
*
*
*
*
*
*Station
Manhattan Tourist Problem (MTP): Formulation
• Goal: Find the longest path in a weighted grid.
• Input: A weighted grid G with two distinct vertices, one
labeled “source” and the other labeled “sink.”
• Output: A longest path in G from “source” to “sink.”
MTP Greedy Algorithm
• Our first try at solving the MTP will use a greedy algorithm.
• Main Idea: At each node (intersection), choose the edge
(street) departing that node which has the greatest weight.
MTP Greedy Algorithm: Example
0
source
1
3
2
3
2
4
4
0
j coordinate
0
1
3
1
i coordinate
0
0
2
4
5
5
4
2
0
8
6
1
3
3
3
1
2
5
4
3
2
4
7
3
3
4
2
6
4
4
2
1
2
3
5
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
6
4
3
3
5
4
5
1
3
1
4
2
0
8
6
2
3
3
j coordinate
3
2
5
4
0
4
7
4
4
2
2
0
2
4
0
4
3
2
3
1
2
1
2
3
5
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
2
0
2
4
5
1
4
2
0
8
3
2
3
3
j coordinate
3
2
5
6
1
4
5
4
4
0
4
7
3
3
4
2
6
4
3
2
0
3
1
4
3
2
1
2
3
5
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
4
5
1
4
2
0
8
3
2
3
3
j coordinate
3
2
5
6
1
4
5
4
4
0
4
7
3
3
4
5
2
0
3
2
6
4
2
2
0
3
1
4
3
2
1
2
3
5
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
4
5
1
1
4
2
0
8
3
3
2
5
6
0
j coordinate
2
3
3
4
4
5
4
9
4
7
3
3
4
5
2
0
3
2
6
4
2
2
0
3
1
4
3
2
1
2
3
5
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
4
5
5
1
3
j coordinate
3
2
1
4
2
0
8
6
13
3
3
0
2
5
4
9
4
4
4
7
3
3
4
5
2
0
3
2
6
4
2
2
0
3
1
4
3
2
1
2
3
5
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
4
5
5
3
13
15
0
2
1
4
1
2
3
5
2
j coordinate
3
2
8
6
1
3
3
0
2
5
4
9
4
4
4
7
3
3
4
5
2
0
3
2
6
4
2
2
0
3
1
4
3
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
4
5
5
3
13
15
0
2
1
4
19
1
2
3
5
2
j coordinate
3
2
8
6
1
3
3
0
2
5
4
9
4
4
4
7
3
3
4
5
2
0
3
2
6
4
2
2
0
3
1
4
3
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
4
5
5
3
13
15
0
2
1
4
19
1
2
20
3
5
2
j coordinate
3
2
8
6
1
3
3
0
2
5
4
9
4
4
4
7
3
3
4
5
2
0
3
2
6
4
2
2
0
3
1
4
3
2
2
sink
MTP Greedy Algorithm: Example
0
source
0
0
1
3
1
i coordinate
4
5
5
3
13
15
0
2
1
4
19
1
2
20
3
5
2
j coordinate
3
2
8
6
1
3
3
0
2
5
4
9
4
4
4
7
3
3
4
5
2
0
3
2
6
4
2
2
0
3
1
4
3
2
2
23
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
1
4
3
5
0
0
5
5
1
5
0
5
0
2
0
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
1
4
3
5
0
0
5
5
1
5
0
5
0
2
0
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
1
4
3
5
0
0
5
5
1
5
0
5
0
2
0
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
1
4
3
5
0
0
5
5
1
5
0
5
0
2
0
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
1
4
3
5
0
0
5
5
1
5
0
5
0
2
0
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
1
4
3
5
0
0
5
5
1
5
0
5
0
2
0
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
1
4
3
5
0
0
5
5
1
5
0
5
0
2
0
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
0
1
4
3
5
0
0
5
5
1
5
0
This weight
heavily
influences the
choice of
optimal path.
5
2
0
22
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
0
1
4
3
5
0
0
5
5
1
5
0
This weight
heavily
influences the
choice of
optimal path.
5
2
0
22
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
0
1
4
3
5
0
0
5
5
1
5
0
This weight
heavily
influences the
choice of
optimal path.
5
2
0
22
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
0
1
4
3
5
0
0
5
5
1
5
0
This weight
heavily
influences the
choice of
optimal path.
5
2
0
22
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
0
1
4
3
5
0
0
5
5
1
5
0
This weight
heavily
influences the
choice of
optimal path.
5
2
0
22
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
0
1
4
3
5
0
0
5
5
1
5
0
This weight
heavily
influences the
choice of
optimal path.
5
2
0
22
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
3
0
1
4
3
5
0
0
5
5
1
5
0
This weight
heavily
influences the
choice of
optimal path.
5
2
0
22
sink
MTP Greedy Algorithm Is Not Optimal
source
1
2
3
5
2
3
10
2
5
5
1
5
This weight
heavily
influences the
choice of
optimal path.
5
3
1
4
3
Optimal path
5
0
0
0
0
2
0
18
22
sink
MTP: Simple Recursive Program
MT(n,m)
if n=0 or m=0
return MT(n,m)
x  MT(n-1,m)+
length of the edge from (n- 1,m) to (n,m)
y  MT(n,m-1)+
length of the edge from (n,m-1) to (n,m)
return max{x,y}
MTP: Simple Recursive Program
1. MT(n,m)
2.
if n=0 or m=0
3.
return MT(n,m)
4.
x  MT(n-1,m)+ length of the edge from (n1,m)
to (n,m)
5.
y  MT(n,m-1)+ length of the edge from (n,m1)
to (n,m)
6. return max{x,y}
•
What’s wrong with this approach?
MTP: Dynamic Programming
0
source
1
0
i
1
1
S0,1 = 1
5
1
5
S1,0 = 5
• We calculate the optimal path score for each vertex in the graph.
• A given vertex’s score is the maximum sum of incoming edge
weight and prior vertex’s score (along that incoming edge).
MTP: Dynamic Programming
j
0
source
1
1
0
i
2
1
5
3
-5
1
5
3
2
2
8
S2,0 = 8
4
S1,1 = 4
3
S0,2 = 3
• Gold edges
represent edges
selected by the
algorithm as
maximal.
MTP: Dynamic Programming
j
0
source
1
1
0
i
5
10
-5
1
1
5
4
5
3
-5
2
3
3
8
0
8
S3,0 = 8
9
S2,1 = 9
• Gold edges
represent edges
selected by the
algorithm as
8
maximal.
S3,0 = 8
3
2
1
5
3
2
13
S1,2 = 13
MTP: Dynamic Programming
j
0
source
1
1
0
i
5
13
5
-3
3
-5
8
9
0
0
0
8
-5
9
S3,1 = 9
• Gold edges
represent edges
selected by the
algorithm as
maximal.
-5
4
3
8
10
1
5
2
3
3
-5
1
3
2
1
5
3
2
12
S2,2 = 12
8
S1,3 = 8
MTP: Dynamic Programming
j
0
source
1
1
0
i
5
13
5
3
9
12
0
-5
0
8
2
3
8
8
-3
-5
0
-5
0
9
• Gold edges
represent edges
selected by the
algorithm as
maximal.
-5
4
3
8
10
1
5
2
3
3
-5
1
3
2
1
5
3
2
9
S3,2 = 9
15
S2,3 = 15
MTP: Dynamic Programming
j
0
source
1
1
0
i
5
13
5
3
9
12
0
15
-5
0
8
2
3
8
8
-3
-5
0
-5
1
0
0
9
• Gold edges
represent edges
selected by the
algorithm as
maximal.
-5
4
3
8
10
1
5
2
3
3
-5
1
3
2
1
5
3
2
9
16
S3,3 = 16
MTP: Dynamic Programming
j
0
source
1
1
0
i
5
13
5
3
9
12
0
15
-5
0
8
2
3
8
8
-3
-5
0
-5
-5
4
3
8
10
1
5
2
3
3
-5
1
3
2
1
5
3
2
1
0
0
9
9
16
• Gold edges
represent edges
selected by the
algorithm as
maximal.
• Once we reach
the sink, we
backtrack along
gold edges to
the source to
find the optimal
(green) path.

MTP: Running Time with Dynamic Programming
• The score si, j for a point (i,j) is given by the recurrence:

si1, j  weight of edge between i 1, j  and i, j 
si, j  max 

si, j 1 +weight of edge between (i, j 1) and (i, j)
• The running time is n x m for an n by m grid.
• (n = # of rows, m = # of columns)
Traversing the Manhattan Grid
• So that we don’t repeat
ourselves, we need a strategy for
actually traversing through the
Manhattan grid to calculate the
distances.
• Three common strategies:
a) Column by column
b) Row by row
c) Along diagonals
a)
c)
b)
Manhattan Is Not a Rectangular Grid
• What about diagonals?
A2
A3
A1
B
• The score at point B is given by the recurrence:
s  weight of edge A ,B
A1
1


sB  max sA 2  weight of edge A2 ,B


sA 3  weight of edge A3 ,B
Section 4:
Longest Path in a Graph
Recursion for an Arbitrary Graph
• We would like to compute the score for point x in an
arbitrary graph.
• Let Predecessors(x) be the set of vertices with edges leading
into x. Then the recurrence is given by:
sx 
max
s
y in Predecessors(x )
y
 weight of vertex
y, x 
• The running time for a graph with E edges is O(E), since
each edge is evaluated once.
Recursion for an Arbitrary Graph: Problem
• The only hitch is that we must decide on the order in which we
visit the vertices.
• By the time the vertex x is analyzed, the values sy for all its
predecessors y should already be computed.
• If the graph has a cycle, we will get stuck in the pattern of
going over and over the same cycle.
• In the Manhattan graph, we escaped this problem by requiring
that we could only move east or south. This is what we would
like to generalize…
Some Graph Theory Terminology
• Directed Acyclic Graph (DAG): A graph in which each edge is
provided an orientation, and which has no cycles.
• We represent the edges of a DAG with directed arrows.
• In the following example, we can move along the edge from B to
C, but not from C to B.
• Note that BCE does not form
a cycle, since we cannot travel
from B to C to E and back to B.
http://commons.wikimedia.org/wiki/File:Directed_acyclic_graph.svg
Some Graph Theory Terminology
• Topological Ordering: A labeling of the vertices of a DAG
(from 1 to n, say) such that every edge of the DAG connects a
vertex with a smaller label to a vertex with a larger label.
• In other words, if vertices are positioned on a line in an
increasing order, then all edges go from left to right.
• Theorem: Every DAG has a topological ordering.
• What this means: Every DAG has a source node (1) and a sink
node (n).
Topological Ordering: Example
• A superhero’s costume can be represented by a DAG: he can’t
put his boots on before his tights!
• He also would like an understandable representation of the
graph, so that he can dress quickly.
Topological Ordering: Example
• Here are two different topological orderings of his DAG:
Longest Path in a DAG: Formulation
• Goal: Find a longest path between two vertices in a weighted
DAG.
• Input: A weighted DAG G with source and sink vertices.
• Output: A longest path in G from source to sink.
• Note: Now we know that we can apply a topological ordering
to G, and then use dynamic programming to find the longest
path in G.
Section 5:
Sequence Alignment
Back to Biology: Sequence Alignment
• Recall that our original problem was to fit a similarity score
on two DNA sequences:
• To do this we will use what is called an alignment matrix.
Alignment Matrix: Example
• Given 2 DNA sequences v and w of length m and n:
v: ATCTGAT
w: TGCATA
m=7
n=6
• Example of Alignment : 2 * k matrix ( k > m, n )
letters of v
A
T
--
G
T
T
A
T
--
letters of w
A
T
C
G
T
--
A
--
C
4 matches
2 insertions
2 deletions
Common Subsequence
• Given two sequences
v = v1 v2…vm and w = w1 w2…wn
a common subsequence of v and w is a sequence of positions
in
v: 1 < i1 < i2 < … < it < m and a sequence of positions in
w: 1 < j1 < j2 < … < jt < n such that the it -th letter of v is
equal to the jt-th letter of w.
• Example: v = ATGCCAT, w = TCGGGCTATC. Then take:
• i1 = 2, i2 = 3, i3 = 6, i4 = 7
• j1 = 1, j2 = 3, j3 = 8, j4 = 9
• This gives us that the common subsequence is TGAT.
Longest Common Subsequence
• Given two sequences v = v1 v2…vm and w = w1 w2…wn
the Longest Common Subsequence (LCS) of v and w is a
sequence of positions in v: 1 < i1 < i2 < … < iT < m and
a sequence of positions in w: 1 < j1 < j2 < … < jT < n
such that the it -th letter of v is equal to jt-th letter of w
and T is maximal.
• Example: v = ATGCCAT, w = TCGGGCTATC.
• Before we found that TGAT is a subsequence.
• The longest subsequence is TGCAT.
• But…how do we find the LCS of two sequences?
Edit Graph for LCS Problem
• Assign one
sequence to the
rows, and one to
the columns.
i 0
T1
G2
C3
A4
T5
A6
C7
j
A
T
C
T
G
A
T
C
0
1
2
3
4
5
6
7
8
Edit Graph for LCS Problem
• Assign one
sequence to the
i 0
rows, and one to
T1
the columns.
G2
• Every diagonal
edge represents a
C3
match of elements.
A4
T5
A6
C7
j
A
T
C
T
G
A
T
C
0
1
2
3
4
5
6
7
8
Edit Graph for LCS Problem
• Assign one
sequence to the
i 0
rows, and one to
T1
the columns.
G2
• Every diagonal
edge represents a
C3
match of elements.
• Therefore, in a path A 4
from source to sink, T 5
the diagonal edges
A6
represent a
common
C7
subsequence.
j
A
T
C
T
G
A
T
C
0
1
2
3
4
5
6
7
8
Common Subsequence: TGAT
Edit Graph for LCS Problem
• LCS Problem:
Find a path with the i 0
maximum number
of diagonal edges. T 1
j
A
T
C
T
G
A
T
C
0
1
2
3
4
5
6
7
8
G2
C3
A4
T5
A6
C7
Common Subsequence: TGAT
Computing the LCS: Dynamic Programming
• Let vi = prefix of v of length i: v1 … vi
• and wj = prefix of w of length j: w1 … wj
• The length of LCS(vi,wj) is computed by:
s
 i1, j
si, j  max si, j 1

si1, j 1 1 if v i  w j
Section 6:
Edit Distance
Hamming Distance
• The Hamming Distance dH(v, w) between two DNA
sequences v and w of the same length is equal to the number of
places in which the two sequences differ.
• Example: Given as follows, dH(v, w) = 8:
v: ATATATAT
w: TATATATA
• However, note that these sequences are still very similar.
• Hamming Distance is therefore not an ideal similarity score,
because it ignores insertions and deletions.
Edit Distance
• Levenshtein (1966) introduced the edit distance between
two strings as the minimum number of elementary
operations (insertions, deletions, and substitutions) needed to
transform one string into the other
d(v,w) = MIN number of elementary operations
to transform v  w
Edit Distance: Example 1
• So in our previous example, we can shift w one nucleotide to
the right, and see that w is obtained from v by one insertion
and one deletion:
v: ATATATATw: -TATATATA
• Hence the edit distance, d(v, w) = 2.
• Note: In order to provide this distance, we had to “fiddle”
with the sequences. Hamming distance was easier to find.
Edit Distance: Example 2
• We can transform TGCATAT  ATCCGAT in 5 steps:
TGCATAT
Edit Distance: Example 2
• We can transform TGCATAT  ATCCGAT in 5 steps:
TGCATAT
(delete last T)
Edit Distance: Example 2
• We can transform TGCATAT  ATCCGAT in 5 steps:
TGCATAT
TGCATA
(delete last T)
(delete last A)
Edit Distance: Example 2
• We can transform TGCATAT  ATCCGAT in 5 steps:
TGCATAT
TGCATA
ATGCAT
(delete last T)
(delete last A)
(insert A at front)
Edit Distance: Example 2
• We can transform TGCATAT  ATCCGAT in 5 steps:
TGCATAT
TGCATA
ATGCAT
ATCCAT
(delete last T)
(delete last A)
(insert A at front)
(substitute C for G)
Edit Distance: Example 2
• We can transform TGCATAT  ATCCGAT in 5 steps:
TGCATAT
TGCATA
ATGCAT
ATCCAT
ATCCGAT
(delete last T)
(delete last A)
(insert A at front)
(substitute C for G)
(insert G before last A)
Edit Distance: Example 2
• We can transform TGCATAT  ATCCGAT in 5 steps:
TGCATAT
TGCATA
ATGCAT
ATCCAT
ATCCGAT
(delete last T)
(delete last A)
(insert A at front)
(substitute C for G)
(insert G before last A)
• Note: This only allows us to conclude that the edit distance is at
most 5.
Edit Distance: Example 2
• Now we transform TGCATAT  ATCCGAT in 4 steps:
ATGCATAT
Edit Distance: Example 2
• Now we transform TGCATAT  ATCCGAT in 4 steps:
ATGCATAT
(insert A at front)
Edit Distance: Example 2
• Now we transform TGCATAT  ATCCGAT in 4 steps:
ATGCATAT
ATGCATAT
(insert A at front)
(delete second T)
Edit Distance: Example 2
• Now we transform TGCATAT  ATCCGAT in 4 steps:
ATGCATAT
ATGCATAT
ATGCGAT
(insert A at front)
(delete second T)
(substitute G for A)
Edit Distance: Example 2
• Now we transform TGCATAT  ATCCGAT in 4 steps:
ATGCATAT
ATGCATAT
ATGCGAT
ATCCGAT
(insert A at front)
(delete second T)
(substitute G for A)
(substitute C for G)
Edit Distance: Example 2
• Now we transform TGCATAT  ATCCGAT in 4 steps:
ATGCATAT
ATGCATAT
ATGCGAT
ATCCGAT
(insert A at front)
(delete second T)
(substitute G for A)
(substitute C for G)
• Can we do even better? 3 steps? 2 steps? How can we know?
Key Result
• Theorem: Given two sequences v and w of length m and n, the
edit distance d(v,w) is given by d(v,w) = m + n – s(v,w), where
s(v,w) is the length of the longest common subsequence of v
and w.
• This is great news, because it means that if solving the LCS
problem for v and w is equivalent to finding the edit distance
between them.
• Therefore, we will solve the LCS problem instead in the
following slides.
Return to the Edit Graph
• Every alignment corresponds
to a path from source to sink.
Return to the Edit Graph
• Every alignment corresponds
to a path from source to sink.
• Horizontal and vertical edges
correspond to indels
(deletions and insertions).
Return to the Edit Graph
• Every alignment corresponds
to a path from source to sink.
• Horizontal and vertical edges
correspond to indels
(deletions and insertions).
• Diagonal edges correspond to
matches and mismatches.
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
• Edit graph path:
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
• Edit graph path:
(0,0)
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
• Edit graph path:
(0,0)(1,1)
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
(2,3)
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
(2,3)(3,4)
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
(2,3)(3,4)etc.
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
(2,3)(3,4)etc.
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
(2,3)(3,4)etc.
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
(2,3)(3,4)etc.
Alignment as a Path in the Edit Graph: Example
• Suppose that our sequences
are ATCGTAC, ATGTTAT.
• One possible alignment is:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
• Edit graph path:
(0,0)(1,1)(2,2)
(2,3)(3,4)etc.
Alignment with Dynamic Programming
• Initialize 0th row and 0th
column to be all zeroes.
Alignment with Dynamic Programming
• Initialize 0th row and 0th
column to be all zeroes.
• Use the following
recursive formula to
calculate Si,j for each i, j:
Si-1, j-1 + 1
Si, j = max
Si-1, j
Si, j-1
Dynamic Programming Example
• Note that the placement of the arrows shows where a given
score originated from:
•
if from the top
•
if from the left
•
if vi = wj
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
• As we can see, we do not
simply add 1 each time.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
• As we can see, we do not
simply add 1 each time.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
• As we can see, we do not
simply add 1 each time.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
• As we can see, we do not
simply add 1 each time.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
• As we can see, we do not
simply add 1 each time.
Dynamic Programming Example
• Continuing with the
dynamic programming
algorithm fills the table.
• We first look for matches,
highlighted in red, and then
we fill in the rest of that
row and column.
• As we can see, we do not
simply add 1 each time.
Dynamic Alignment: Pseudocode
1. LCS(v,w)
2.
for i  1 to n
3.
si,0  0
4.
for j  1 to m
5.
s0,j  0
6.
for i  1 to n
7.
for j  1 to m
8.
si-1,j
9.
si,j  max
si,j-1
10.
si-1,j-1 + 1, if vi = wj
11.
“
“
if si,j = si-1,j
•
bi,j 
“
“
if si,j = si,j-1
•
“
“
if si,j = si-1,j-1 + 1
•
return (sn,m, b)
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
6
A
A
6
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
5
T
_
5
6
A
A
6
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
4
T
T
5
5
T
_
5
6
A
A
6
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
_
C
6
7
T
_
7
Finding an Optimal Alignment
• LCS(v,w) created the
alignment grid.
• Follow the arrows
backwards from the sink
to the source to obtain
the path corresponding to
an optimal alignment:
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
_
C
6
7
T
_
7
Printing LCS: Backtracking
1. PrintLCS(b,v,i,j)
2.
if i = 0 or j = 0
3.
return
4.
if bi,j = “
“
5.
PrintLCS(b,v,i-1,j-1)
6.
print vi
7.
else
8.
if bi,j = “
“
9.
PrintLCS(b,v,i-1,j)
10.
else
11.
PrintLCS(b,v,i,j-1)
LCS: Runtime
• It takes O(nm) time to fill in the nxm dynamic programming
matrix: the pseudocode consists of a nested “for” loop inside
of another “for” loop.
• This is a very positive result for a problem that appeared much
more difficult initially.
Download