Traceback and local alignment - Noble Research Lab

advertisement
Traceback and local alignment
Prof. William Stafford Noble
Department of Genome Sciences
Department of Computer Science and Engineering
University of Washington
thabangh@gmail.com
Outline
• Responses from last class
• Sequence alignment
– Motivation
– Scoring alignments
• Python
One-minute responses
•
•
•
•
•
•
•
•
Thank you for doing the one-minute
responses and the revisions.
Liked that you let us solve the DP.
Liked going to lab in second half of
lecture.
More slowly with Python
programming.
Please explain more about string
handling in Python, especially
sys.argv.
sys.argv is now more clear.
Need more practical problems using
sys.
I know how to compute the DP score
but not how to get the alignment.
•
•
•
•
•
•
•
I still do not get how sequence
alignment works.
Can we go over DP again?
Please don’t give us a test because
our essay phase has begun already.
Please limit the number of questions
in class.
Moving too slow with previous work
and too fast with current work.
I understood about 95% of the
lecture.
Can you give us some Python
documentation or some interesting
web site to improve and learn?
Revision
• What two things are needed to score a
pairwise alignment?
– A substitution matrix and a gap penalty.
• What does entry (i,j) in the DP matrix store?
– The score of the best-scoring alignment up to
those positions.
• What are the three valid moves when filling in
the DP matrix?
– Horizontal and vertical, corresponding to gaps.
Diagonal, corresponding to a substitution.
A small example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
F i  1, j  1
sxi , y j 
F i 1, j 
d
F i, j 1
G
d
C
F i, j 
A
G
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
sxi , y j 
F i 1, j 
d
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
0
A
-5
F i, j 1
G
-10
d
C
-15
F i, j 
A
A
G
-5
-10
-15
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
sxi , y j 
F i 1, j 
d
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
-5
-10
-15
A
-5
2
-3
-8
F i, j 1
G
-10
-3
-3
-1
d
C
-15
-8
-8
-6
F i, j 
Traceback
• Start from the lower right corner and trace back to
the upper left.
• Each arrow introduces one character at the end of
each aligned sequence.
• A horizontal move puts a gap in the left sequence.
• A vertical move puts a gap in the top sequence.
• A diagonal move uses one character from each
sequence.
A simple example
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
•
•
•
•
•
Start from the lower right
corner and trace back to the
upper left.
Each arrow introduces one
character at the end of each
aligned sequence.
A horizontal move puts a gap
in the left sequence.
A vertical move puts a gap in
the top sequence.
A diagonal move uses one
character from each
sequence.
A
0
A
A
G
-5
2
-3
G
-1
C
-6
A simple example
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
•
•
•
•
•
Start from the lower right
corner and trace back to the
upper left.
Each arrow introduces one
character at the end of each
aligned sequence.
A horizontal move puts a gap
in the left sequence.
A vertical move puts a gap in
the top sequence.
A diagonal move uses one
character from each
sequence.
A
0
A
A
G
-5
2
-3
G
-1
C
-6
AAG-AGC
AAGA-GC
GA-ATC
CATA-C
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
GAAT-C
CA-TAC
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
GAAT-C
C-ATAC
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
GAAT-C
-CATAC
DP matrix
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Multiple solutions
GA-ATC
CATA-C
GAAT-C
CA-TAC
GAAT-C
C-ATAC
GAAT-C
-CATAC
• When a program returns a
sequence alignment, it may not
be the only best alignment.
Traceback problem #1
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Write down the alignment corresponding to the circled score.
GA
CA
Solution #1
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Write down the alignment corresponding to the circled score.
Traceback problem #2
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Write down three alignments corresponding to the circled score.
GAATC
CA---
Solution #2
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Write down three alignments corresponding to the circled score.
GAATC
C-A--
Solution #2
GAATC
CA---
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Write down three alignments corresponding to the circled score.
Solution #2
GAATC
-CA--
GAATC
C-A--
GAATC
CA---
G
A
A
T
C
0
-4
-8
-12
-16
-20
C
-4
-5
-9
-13
-12
-6
A
-8
-4
5
1
-3
-7
T
-12
-8
1
0
11
7
A
-16
-12
2
11
7
6
C
-20
-16
-2
7
11
17
Write down three alignments corresponding to the circled score.
Local alignment
• A protein may be homologous to a region within a
second protein.
• Usually, an alignment that spans the complete length
of both sequences is not required.
BLAST allows local alignments
Global
alignment
Local
alignment
Global alignment DP
• Align sequence x and y.
• F is the DP matrix; s is the substitution matrix;
d is the linear gap penalty.
F 0,0   0
 F i  1, j  1  s xi , y j 

F i, j   max F i  1, j   d
 F i, j  1  d

Local alignment DP
• Align sequence x and y.
• F is the DP matrix; s is the substitution matrix;
d is the linear gap penalty.
F 0,0   0

 F i  1, j  1  s xi , y j 

F i, j   max F i  1, j   d
 F i, j  1  d

0
Local DP in equation form
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
F i, j 1
d
F i, j 
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
F i, j 1
G
d
C
F i, j 
A
G
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
0
A
0
F i, j 1
G
0
d
C
0
F i, j 
A
A
G
0
0
0
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
0
0
A
0
F i, j 1
G
0
d
C
0
F i, j 
A
A
G
0
0
0
2
-5
-5
0
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
0
0
0
A
0
2
F i, j 1
G
0
?
d
C
0
?
F i, j 
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
0
0
0
A
0
2
?
?
F i, j 1
G
0
0
?
?
d
C
0
0
?
?
F i, j 
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
0
0
0
A
0
2
2
0
F i, j 1
G
0
0
0
4
d
C
0
0
0
0
F i, j 
Local alignment
• Two differences with respect to global
alignment:
– No score is negative.
– Traceback begins at the highest score in the matrix
and continues until you reach 0.
• Global alignment algorithm: NeedlemanWunsch.
• Local alignment algorithm: Smith-Waterman.
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
0
0
0
A
0
2
2
0
F i, j 1
G
0
0
0
4
d
C
0
0
0
0
F i, j 
AG
AG
Local alignment
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and GAAGGC.
Use a gap penalty of d=-5.
F i, j 1
d
F i, j 
G
A
A
G
G
C
0
0
0
0
0
0
0
A
0
A
0
G
0
Local alignment
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and GAAGGC.
Use a gap penalty of d=-5.
F i, j 1
d
F i, j 
G
A
A
G
G
C
0
0
0
0
0
0
0
A
0
0
2
2
0
0
0
A
0
0
2
4
0
0
0
G
0
2
0
0
6
2
0
Local alignment
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and GAAGGC.
Use a gap penalty of d=-5.
AAG
AAG
F i, j 1
d
F i, j 
G
A
A
G
G
C
0
0
0
0
0
0
0
A
0
0
2
2
0
0
0
A
0
0
2
4
0
0
0
G
0
2
0
0
6
2
0
Summary
• Local alignment finds the best match between
subsequences.
• Smith-Waterman local alignment algorithm:
– No score is negative.
– Trace back from the largest score in the matrix.
Sample problem #1
• Given:
– Two letters
– A substitution matrix written as three columns (letter, letter,
value)
A C 0
A D -2
• Return:
– The value associated with the two letters
• You must store the substitution matrix in memory.
• You must account for the fact that the order of the letters
doesn’t matter.
Solution outline
• Read the substitution matrix into a dictionary.
– Each line of the file has three values.
A C 4
A D -2
– Each line is stored in a dictionary with the first two
entries as a tuple key and the last entry as the
value.
substitionMatrix[(letter1, letter2)] = value
Sample problem #2
• Given:
– A file containing two sequences of equal length (one
per line)
– A substitution matrix written as three columns (letter,
letter, value)
• Return:
– The score of the ungapped alignment between the
sequences
• Test using input files from class web page.
– Solutions: 1 = 69, 2 = 104, 3 = 153
Solution outline
• Store the substitution matrix in memory, as
before.
• Check to be sure the sequences are the same
length, and report an error if they are not.
• Use a for loop to traverse both sequences at
once.
One-minute response
At the end of each class
• Write for about one minute.
• Provide feedback about the class.
• Was part of the lecture unclear?
• What did you like about the class?
• Do you have unanswered questions?
I will begin the next class by responding to the
one-minute responses
Download