Sequence comparison: Dynamic programming

advertisement
Sequence comparison:
Dynamic programming
Genome 559: Introduction to Statistical
and Computational Genomics
Prof. William Stafford Noble
One-minute responses
•
•
•
•
•
•
•
•
•
The pace was good until the end. One or two more minutes for each sample problem
would have been good.
The pace worked well for me. I scrambled a bit to keep up on the computer but was
able to do so.
I feel like a fish out of water, but the pace of the class was good. With this kind of
stuff it’s really easy to feel stupid if you haven’t done it before, so I like the relaxed
intro feeling.
I thought the pace and amount of material was fine.
Thought it covered the basics, didn’t move too fast.
I am very excited about this class. I enjoyed the slow and steady pacing because I
am a beginner in programming. The sample problems are excellent. I wish there
was time for more.
It seemed that the class was a bit too fast. It was hard to keep up. I was not sure
what I was doing! Perhaps prior reading would help.
Thanks for moving so slowly and being so patient. Please keep it up.
I have no background in programming, and though I recognize that today’s class was
very basic, I struggled some. I am not sure how I would have had an easier time –
perhaps if more time had been taken explaining how each component (import, math,
commas, etc.) had impact on function.
One-minute responses
•
It is still unclear to me why the “float” prefix is necessary.
–
•
I tried importing math and then using math.sum(arg1, arg2) because it seemed logical after the log
example.
–
•
•
The broader meaning is anything that is given as input to a function; e.g., 10 is the argument to
math.log(10). In general, Python functions take arguments in parentheses. (The print command is an
exception to this rule.)
I like that the instructors walk around to check on things.
It might help to explain where we can look up packages to do various functions.
–
•
•
Yes.
Is that generally what is meant by an argument, or does it have a broader meaning?
–
•
•
This was an error on my part – I should have introduced the “+” operator before that sample problem.
Is what we’re referring to as an “argument” anything that follows the program name on the
command line?
–
•
Because otherwise Python will define the argument as a string, rather than a number.
Your book describes the most common packages. Otherwise, you can search at python.org.
I think some simple terms could be defined like “terminal,” “system,” “argument.”
The programming exercises were useful, but if I didn’t already have a basic knowledge of Perl I
would be confused by the terminology (operators, variables, objects, working directory, etc.).
This was my first time programming, so I was confused by terminology. After seeing the
examples, I started to understand it better, though.
One-minute responses
•
One thing I may like is if we could have an appendix for the commands we
learned that day. I imagine maybe the book has it?
– All of the commands you learn are in the book. I encourage each of you to
maintain a summary of what we’ve learned thus far.
•
Could you give supplementary problems to work on?
– Yes, I will try to do this.
•
•
It was helpful having a printed copy of the lab task.
Can you print quotation marks? Are there other symbols that are off-bounds
to print?
– You can print anything. Unusual characters need to be preceded by a backslash:
print "\""
•
•
•
It was a little unclear at first what was the relevance of the sys.argv
command.
I probably missed it but the only thing I could think of was more advanced
warning on the text.
It was hard to follow directions without having read for myself to put into
proper context.
Sequence comparison overview
• Problem: Find the “best” alignment between a
query sequence and a target sequence.
• To solve this problem, we need
– ___________________, and
– ___________________.
• The alignment score is calculated using
– ___________________, and
– ___________________.
• The algorithm for finding the best alignment is
__________________.
Sequence comparison overview
• Problem: Find the “best” alignment between a
query sequence and a target sequence.
• To solve this problem, we need
– a method for scoring alignments, and
– an algorithm for finding the alignment with the best
score.
• The alignment score is calculated using
– a substitution matrix, and
– gap penalties.
• The algorithm for finding the best alignment is
dynamic programming.
Review
• What does the 62 in BLOSUM62 mean?
• Why does leucine and isoleucine get a
BLOSUM62 score of 2, whereas leucine
and aspartic acid get a score of -4?
• What is the difference between a gap
open and a gap extension penalty?
Review
• What does the 62 in BLOSUM62 mean?
– The sequences in the alignment used to generate the matrix
share 62% pairwise sequence identity.
• Why does leucine and isoleucine get a BLOSUM62
score of 2, whereas leucine and aspartic acid get a score
of -4?
– Leucine and isoleucine are biochemically very similar;
consequently, a substitution of one for the other is much more
likely to occur.
• What is the difference between a gap open and a gap
extension penalty?
– When we assign a score to an observed gap in an alignment, we
charge a larger penalty for the first gapped position than for
subsequent positions in the same gap.
A simple alignment problem.
• Problem: find the best pairwise alignment
of GAATC and CATAC.
• Use a linear gap penalty of -4.
• Use the following substitution matrix:
A
C
G
T
A
10
-5
0
-5
C
-5
10
-5
0
G
0
-5
10
-5
T
-5
0
-5
10
How many possibilities?
GAATC
CATAC
GAAT-C
C-ATAC
-GAAT-C
C-A-TAC
GAATCCA-TAC
GAAT-C
CA-TAC
GA-ATC
CATA-C
• How many different alignments of two
sequences of length N exist?
How many possibilities?
GAATC
CATAC
GAAT-C
C-ATAC
-GAAT-C
C-A-TAC
GAATCCA-TAC
GAAT-C
CA-TAC
GA-ATC
CATA-C
• How many different alignments of two
sequences of length n exist? Too many to
 2n  2n ! 2
  

2
n
 n  n!
2n
enumerate!
-GCAT
DP matrix
G
C
A
T
A
C
A
A
T
C
The value in position (i,j) is the score
of the best alignment of the first i
positions of the first sequence
versus the first j positions of the
second sequence.
-8
-G-A
CAT-
DP matrix
G
A
A
A
C
T
C
Moving horizontally in
the matrix introduces a
gap in the sequence
along the left edge.
C
T
A
-8
-12
-G-CATA
DP matrix
G
A
T
Moving vertically in the
matrix introduces a
gap in the sequence
along the top edge.
C
A
T
-8
A
-12
C
A
C
Initialization
G
0
C
A
T
A
C
A
A
T
C
G
-
Introducing a gap
G
0
C
A
T
A
C
-4
A
A
T
C
C
DP matrix
G
0
C
A
T
A
C
-4
-4
A
A
T
C
DP matrix
G
C
A
T
A
C
0
-4
-4
-8
A
A
T
C
G
C
DP matrix
G
C
A
T
A
C
0
-4
-4
-5
A
A
T
C
----CATAC
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
A
-8
T
-12
A
-16
C
-20
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
A
-8
?
T
-12
A
-16
C
-20
-G
CA
-4
GCA
-9
--G
CA-
DP matrix
-12
0
C
-4
A
-8
T
-12
A
-16
C
-20
0
-4
G
A
A
T
C
-4
-8
-12
-16
-20
-5
-4
-4
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
A
-8
-4
T
-12
?
A
-16
?
C
-20
?
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
A
-8
-4
T
-12
-8
A
-16
-12
C
-20
-16
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
?
A
-8
-4
?
T
-12
-8
?
A
-16
-12
?
C
-20
-16
?
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
A
-8
-4
5
T
-12
-8
1
A
-16
-12
2
C
-20
-16
-2
What is the alignment
associated with this
entry?
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
A
-8
-4
5
T
-12
-8
1
A
-16
-12
2
C
-20
-16
-2
-G-A
CATA
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
A
-8
-4
5
T
-12
-8
1
A
-16
-12
2
C
-20
-16
-2
Find the optimal
alignment, and
its score.
?
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
GA-ATC
CATA-C
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
GAAT-C
CA-TAC
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
GAAT-C
C-ATAC
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
GAAT-C
-CATAC
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Multiple solutions
GA-ATC
CATA-C
GAAT-C
CA-TAC
GAAT-C
C-ATAC
GAAT-C
-CATAC
• When a program returns a
sequence alignment, it may
not be the only best
alignment.
DP in equation form
• Align sequence x and y.
• F is the DP matrix; s is the substitution
matrix; d is the linear gap penalty.
F 0,0  0
 F i  1, j  1  s xi , y j 

F i, j   max  F i  1, j   d
 F i, j  1  d

DP in equation form
F i, j 1
F i  1, j  1
s xi , y j 
F i 1, j 
d
d
F i, j 
Dynamic programming
• Yes, it’s a weird name.
• DP is closely related to recursion and to
mathematical induction.
• We can prove that the resulting score is
optimal.
Summary
• Scoring a pairwise alignment requires a
substition matrix and gap penalties.
• Dynamic programming is an efficient
algorithm for finding the optimal alignment.
• Entry (i,j) in the DP matrix stores the score
of the best-scoring alignment up to those
positions.
• DP iteratively fills in the matrix using a
simple mathematical rule.
Reading
• Nicholas review from Biotechniques, 2000.
Problem: find the best pairwise alignment of GAATC and CATAC.
A
C
G
T
A
10
-5
0
-5
C
-5
10
-5
0
G
0
-5
10
-5
T
-5
0
-5
10
d = -4
G
0
C
A
T
A
C
A
A
T
C
Download