BCB 444/544 Lecture 6 #6_Aug31 Dynamic Programming

advertisement
BCB 444/544
Lecture 6
Try to Finish Dynamic Programming
Global & Local Alignment
Next lecture:
Scoring Matrices
Alignment Statistics
#6_Aug31
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
1
Required Reading
(before lecture)
Mon Aug 27 - for Lecture #4
Pairwise Sequence Alignment
• Chp 3 - pp 31-41
Wed Aug 29 - for Lecture #5
Dynamic Programming
• Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909
http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html
Thurs Aug 30 - Lab #2:
Databases, ISU Resources & Pairwise Sequence Alignment
Fri Aug 31 - for Lecture #6
Scoring Matrices & Alignment Statistics
• Chp 3 - pp 41-49
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
2
Announcements
Fri Aug 31 - Revised notes for Lecture 5 posted online
Changes? mainly re-ordering, symbols, color "coding"
Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! - Enjoy!!
Tues Sept 4 - Lab #2 Exercise Writeup Due by 5 PM (or sooner!)
Send via email to Pete Zaback petez@iastate.edu
(HW#2 assignment will be posted online)
Fri Sept 14 - HW#2 Due by 5 PM (or sooner!)
Fri Sept 21 - Exam #1
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
3
Chp 3- Sequence Alignment
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 3
Pairwise Sequence Alignment
•
•
•
•
•
•
√Evolutionary Basis
√Sequence Homology versus Sequence Similarity
√Sequence Similarity versus Sequence Identity
Methods - cont
Scoring Matrices
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
4
Methods
•
•
•
•
√Global and Local Alignment
√Alignment Algorithms
√Dot Matrix Method
Dynamic Programming Method - cont
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
5
Sequence Homology vs Similarity
• Homologous sequences - sequences that share a common
evolutionary ancestry
• Similar sequences - sequences that have a high percentage of
aligned residues with similar physicochemical properties
(e.g., size, hydrophobicity, charge)
IMPORTANT:
• Sequence homology:
• An inference about a common ancestral relationship, drawn when
two sequences share a high enough degree of sequence similarity
• Homology is qualitative
• Sequence similarity:
• The direct result of observation from a sequence alignment
• Similarity is quantitative; can be described using percentages
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
6
Goal of Sequence Alignment
Find the best pairing of 2 sequences, such that there
is maximum correspondence between residues
• DNA
4 letter alphabet (+ gap)
TTGACAC
TTTACAC
• Proteins
20 letter alphabet (+ gap)
RKVA-GMA
RKIAVAMA
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
7
Statement of Problem
Given:
• 2 sequences
• Scoring system for evaluating match (or
mismatch) of two characters
• Penalty function for gaps in sequences
Find: Optimal pairing of sequences that:
• Retains the order of characters
• Introduces gaps where needed
• Maximizes total score
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
8
Avoiding Random Alignments with a
Scoring Function
• Introducing too many gaps generates nonsense alignments:
s--e-----qu---en--ce
sometimesquipsentice
• Need to distinguish between alignments that occur due to
homology and those that occur by chance
• Define a scoring function that rewards matches (+) and
penalizes mismatches (-) and gaps (-)
Scoring Function (S):
Note: I changed symbols
& colors on this slide!
Match:
Mismatch:
Gap:



e.g.
1
1
0
S = (#matches) - (#mismatches) - (#gaps)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
9
Not All Mismatches are the Same
• Some amino acids are more "exchangeable" than
others (physicochemical properties are similar)
e.g., Ser & Thr are more similar than Trp & Ala
• Substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions
• Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution is
"better" than any other (in general)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
10
Substitution Matrix
s(a,b) corresponds to score of
aligning character a with character b
Match scores are often calculated
based on frequency of mutations in
very similar sequences
(more details later)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
11
Global vs Local Alignment
Global alignment
• Finds best possible alignment across entire length of 2 sequences
• Aligned sequences assumed to be generally similar over entire length
Local alignment
• Finds local regions with highest similarity between 2 sequences
• Aligns these without regard for rest of sequence
• Sequences are not assumed to be similar over entire length
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
12
Global vs Local Alignment - example
1 = CTGTCGCTGCACG
2 = TGCCGTG
Global alignment
CTGTCGCTGCACG
-TG-C-C-G--TG
Local alignment
CTGTCGCTGCACG
-TGCCG-TG----
CTGTCGCTGCACG
-TGCCG-T----G
Which is better?
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
13
Global vs Local Alignment
Which should be used when?
It is critical to choose correct method!
Global Alignment
vs
Local Alignment?
Shout out the answers!! Which should we use for?
1.
2.
3.
4.
5.
Searching for conserved motifs in DNA or protein sequences?
Aligning two closely related sequences with similar lengths?
Aligning highly divergent sequences?
Generating an extended alignment of closely related sequences?
Generating an extended alignment of closely related sequences
with very different lengths?
Hmmm - we'll work on that
Excellent!
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
14
Alignment Algorithms
3 major methods for pairwise sequence alignment:
1. Dot matrix analysis
2. Dynamic programming
3. Word or k-tuple methods (later, in Chp 4)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
15
Dot Matrix Method (Dot Plots)
• Place 1 sequence along top row of matrix
• Place 2nd sequence along left column of
matrix
• Plot a dot each time there is a match between
an element of row sequence and an element of
column sequence
• For proteins, usually use more sophisticated
scoring schemes than "identical match"
• Diagonal lines indicate areas of match
• Contiguous diagonal lines reveal alignment;
"breaks" = gaps (indels)
A C G C G
A
C
A
C
G
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
16
Interpretation of Dot Plots
When comparing 2 sequences:
• Diagonal lines of dots indicate regions of similarity
between 2 sequences
• Reverse diagonals (perpendicular to diagonal) indicate
inversions
• What do such patterns mean when comparing
a sequence with itself (or its reverse
complement)?
• e.g.: Reverse diagonals crossing diagonals (X's) indicate
palindromes
Exploring Dot Plots
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
17
Dynamic Programming
For Pairwise sequence alignment
Idea: Display one sequence above another with
spaces inserted in both to reveal similarity
CAT-TCA-C
| | || |
C-TCGCAGC
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
18
Global Alignment: Scoring
CTGTCGCTGCACG
-TGC-CG-TG---Reward for matches: 
Mismatch penalty:

Space/gap penalty: 
Score = w – x - y
w = #matches
x = #mismatches
y = #spaces
Note: I changed symbols
& colors on this slide!
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
19
Global Alignment: Scoring
Reward for matches:
Mismatch penalty:
Space/gap penalty:
10
-2
-5
C T G T C G – C T G C
- T G C – C G – T G -5 10 10 -2 -5 -2 -5 -5 10 10 -5
Note: I changed symbols
& colors on this slide!
Total = 11
We could have done better!!
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
20
Alignment Algorithms
• Global: Needleman-Wunsch
• Local: Smith-Waterman
• Both NW and SW use dynamic programming
• Variations:
• Gap penalty functions
• Scoring matrices
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
21
Dynamic Programming - Key Idea:
The score of the best possible alignment that ends at a
given pair of positions (i, j) is equal to:
the score of best alignment ending just previous to
those two positions (i.e., ending at i-1, j-1)
PLUS
the score for aligning xi and yj
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
22
Global Alignment:
DP Problem Formulation & Notations
Given two sequences (strings)
• X = x1x2…xN of length N
x = AGC
N=3
• Y = y1y2…yM of length M
y = AAAC
M=4
Construct a matrix with (N+1) x (M+1) elements, where
S(i,j) = Score of best alignment of x[1..i]=x1x2…xi with y[1..j]=y1y2…yj
x1
x2
x3
Which means: Score of best alignment of
a prefix of X and a prefix of Y
y1
y2
y3
S(2,3) = score of best alignment
of AG (x1x2) to AAA (y1y2y3)
y4
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
23
Dynamic Programming - 4 Steps:
1. Define score of optimum alignment, using recursion
2. Initialize and fill in a DP matrix for storing optimal
scores of subproblems, by solving smallest
subproblems first (bottom-up approach)
3. Calculate score of optimum alignment(s)
4. Trace back through matrix to recover optimum
alignment(s) that generated optimal score
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
24
1- Define Score of Optimum Alignment
using Recursion
Define:
x1..i  Prefix of length i of x
y1.. j  Prefix of length j of y
S(i, j)  Score of optimum alignment of x1..i and y1..j
Initial 
conditions:
S(i,0)  i   S(0, j)   j  
Recursive definition:
 1  i  N, 1  j  M:
For
S(i 1, j 1)   (xi , y j )

S(i, j)  max S(i 1, j)  
S(i, j 1)  

BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
25
2- Initialize & Fill in DP Matrix for Storing
Optimal Scores of Subproblems
• Construct sequence vs sequence matrix:
0
1
N
0
S(0,0)=0
1
S(i-1,j-1) S(i-1,j)
S(i,j-1)
S(i,j)
S(N,M)
M
Recursion S(i 1, j 1)   (x , y )

S(i, j)  maxS(i 1, j)  
S(i, j 1)  

i
j
Initialization
S(i,0)  i  
S(0, j)   j  
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
26
2-
cont
Fill in DP Matrix
• Fill in from [0,0] to [N,M] (row by row), calculating best
possible score for each alignment including residues at [i,j]
• Keep track of dependencies of scores (in a pointer matrix).
0
0
1
M
1
N
S(0,0)=0
S(i-1,j-1)
S(i-1,j)
S(i,j-1)
S(i,j)
S(N,M)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
27
3- Calculate Score S(N,M) of Optimum
Alignment - for Global Alignment
What happens in last step in alignment of x[1..i] to y[1..j]?
1 of 3 cases applies:
xi aligns to yj
xi aligns to a gap
yj aligns to a gap
x1 x2 . . . xi-1 xi
x1 x2 . . . xi-1 xi
x1 x2 . . . x i
y1 y2 . . . yj-1 yj
y1 y2 . . . yj
y1 y2 . . . yj-1 yj
S(i-1,j-1) + (xi,yj)
S(i-1,j)
—
-
S(i,j-1)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
—
-
28
Example
Case 1: Line up xi with yj
x: C
y: C
A
-
T
T
T
T
i
i-1
C A C
C A G
j
j -1
Case 2: Line up xi with space
x: C
y: C
A
-
T
T
T
T
C
C
i-1
A A G
i
C
-
j
Case 3: Line up yj with space
x: C
y: C
A
-
T
T
T
T
C
C
i
A C
A j -1
G
j
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
29
Fill in the matrix
λ
λ
C
0
T
C
G
C
A
G
C
-5 -10 -15 -20 -25 -30 -35 -40
-5
C
A -10
T
-15
T
-20
C
A
-25
-30
C
-35
10
5
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
30
Calculate score of optimum alignment
λ
λ
C
T
C
G
C
A
G
C
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10
5
0
-5
-10
-15
-20
-25
-10
5
8
3
-2
-7
0
-5
-10
T
-15
0
15
10
5
0
-5
-2
-7
T
-20
-5
10
13
8
3
-2
-7
-4
-25
-10
5
20
15
18
13
8
3
-30
-15
0
15
18
13
28
23
18
-35
-20
-5
10
13
28
23
26
33
C
A
C
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
31
4- Trace back through matrix to recover
optimum alignment(s) that generated
the optimal score
How? "Repeat" alignment calculations in reverse order,
starting at from position with highest score and
following path, position by position, back through
matrix
Result? Optimal alignment(s) of sequences
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
32
Traceback - for Global Alignment
Start in lower right corner & trace back to upper left
Each arrow introduces one character at end of sequence
alignment:
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
33
Traceback to Recover Alignment
λ
λ
C
A
T
T
C
A
C
C
T
C
G
C
A
G
C
0
-5
-10
-15
-20
-25
-30
-35
-40
-5
10
5
0
-5
-10
-15
-20
-25
-10
5
8
3
-2
-7
0
-5
-10
-15
0
15
10
5
0
-5
-2
-7
8
3
-2
-7
-4
*
-20
-5
10 *
13
-25
-10
5
20
15
18
13
8
3
-30
-15
0
15
18
13
28
23
18
-35
-20
-5
10
13
28
23
26
33
Can have >1 optimum alignment; this example has 2
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
34
What are the 2 Alignments with
Optimum Score = 33?
1:
2:
C
T
C
G
C
A
G
C
A
T
T
C
A
C
C
T
C
G
C
A
G
C
C
T
C
G
C
A
G
C
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
C
8/31/07
35
Local Alignment: Motivation
• To "ignore" stretches of non-coding DNA:
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
• Local alignment between two protein-encoding sequences is
likely to be between two exons
• To locate protein domains or motifs:
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
• Local sequence similarities may indicate ”functional modules”
Non-coding - "not encoding protein"
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
36
Local Alignment:
Example
g g t c t g a g
a a a c g a
Match: +2
Mismatch or space: -1
Best local alignment:
g g t c t g a g
a a a c – g a -
Score = 5
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
37
Local Alignment: Algorithm
•S [i, j] = Score for optimally aligning a suffix of X with
a suffix of Y
• Initialize top row & leftmost column of matrix with "0"
Recall: for Global Alignment,
• S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y
• Initialize top row & leftmost column of with gap penalty
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
38
Traceback - for Local Alignment
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2
0
1
0
0
1
0
0
0
0
1
0
2
0
0
0
1
0
1
0
2
0
1
1
λ
C
A
C
A
C
+1 for a match, -1 for a mismatch, -5 for a space
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
39
Some Results re: Alignment Algorithms
(for ComS, CprE & Math types!)
• Most pairwise sequence alignment problems can be
solved in O(mn) time
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
40
"Scoring" or "Substitution" Matrices
2 Major types for Amino Acids: PAM & BLOSUM
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
41
PAM Matrix
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differnces in closely related proteins
• Model includes defined rate for each type of
sequence change
• Suffix number (n) reflects amount of "time"
passed: rate of expected mutation if n% of amino
acids had changed
• PAM1 - for less divergent sequences (shorter time)
• PAM250 - for more divergent sequences (longer time)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
42
BLOSUM Matrix
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
average % aa identity in the MSA from which the
matrix was generated
• BLOSUM45 - for more divergent sequences
• BLOSUM62 - for less divergent sequences
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
43
Statistical Significance
of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
44
Affine Gap Penalty Functions
Gap penalty = h + gk
where
k = length of gap
h = gap opening penalty
g = gap extension penalty
Can also be solved in
O(nm) time using
dynamic programming
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
45
Download