#7 Still more DP, Scoring Matrices 9/5/07 BCB 444/544 Lecture 7

advertisement
#7 Still more DP, Scoring Matrices
9/5/07
Required Reading
BCB 444/544
(before lecture)
√ Last week: - for Lectures 4-7
Lecture 7
Pairwise Sequence Alignment, Dynamic Programming,
Global vs Local Alignment, Scoring Matrices, Statistics
• Xiong: Chp 3
• Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909
Still more: Dynamic Programming
Global vs Local Alignment
http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html
Scoring Matrices & Alignment Statistics
Wed Sept 5 - for Lecture 7 & Lab 3
BLAST nope
Database Similarity Searching: BLAST
• Chp 4 - pp 51-62
#7_Sept5
Fri Sept - for Lecture 8
BLAST variations; BLAST vs FASTA
• Chp 4 - pp 51-62
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
1
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
Assignments & Announcements
SECTION II
Send via email to Pete Zaback petez@iastate.edu
( For now, no late penalty - just send ASAP)
Fri Sept 21
- Exam #1
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
SEQUENCE ALIGNMENT
Xiong: Chp 3
Pairwise Sequence Alignment
√ Wed Sept 5 - Notes for Lecture 5 posted online
- HW#2 posted online & sent via email
& handed out in class
- HW#2 Due by 5 PM
2
Chp 3- Sequence Alignment
√ Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PM
Fri Sept 14
9/5/07
•
•
•
•
•
•
3
Methods
√ Evolutionary Basis
√ Sequence Homology versus Sequence Similarity
√ Sequence Similarity versus Sequence Identity
Methods - cont
Scoring Matrices
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
4
Global vs Local Alignment
• √ Global and Local Alignment
• √ Alignment Algorithms
• √ Dot Matrix Method
Global alignment
• Dynamic Programming Method - cont
• Aligned sequences assumed to be generally similar over entire length
• Finds best possible alignment across entire length of 2 sequences
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
• Scoring Matrices
Local alignment
• Finds local regions with highest similarity between 2 sequences
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Aligns these without regard for rest of sequence
• Sequences are not assumed to be similar over entire length
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
BCB 444/544 Fall 07 Dobbs
9/5/07
5
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
6
1
#7 Still more DP, Scoring Matrices
9/5/07
Global vs Local Alignment
Which should be used when?
Global vs Local Alignment - example
1 = CTGTCGCTGCACG
2 = TGCCGTG
Global alignment
CTGTCGCTGCACG
It is critical to choose correct method!
Global Alignment
Local alignment
1.
2.
3.
4.
5.
CTGTCGCTGCACG
-TGCCG-T----G
Which is better?
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
7
Excellent!
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
8
Alignment Algorithms
It is critical to choose correct method!
vs
Searching for conserved motifs in DNA or protein sequences?
Aligning two closely related sequences with similar lengths?
Aligning highly divergent sequences?
Generating an extended alignment of closely related sequences?
Generating an extended alignment of closely related sequences
with very different lengths?
Hmmm - we'll work on that
Global vs Local Alignment
Which should be used when?
Global Alignment
Local Alignment?
Shout out the answers!! Which should we use for?
CTGTCGCTGCACG
-TGCCG-TG----
-TG-C-C-G--TG
vs
3 major methods for pairwise sequence alignment:
Local Alignment?
1. Dot matrix analysis √ - practice in HW2
Shout out the answers!! Which should we use for?
2. Dynamic programming - more today & in HW2
1. Searching for conserved motifs in DNA or protein sequences? Local
2. Aligning two closely related sequences with similar lengths?
3. Word or k-tuple methods (later, in Chp 4)
Global
3. Aligning highly divergent sequences? Local (at least initially)
4. Generating an extended alignment of closely related sequences? Global
5. Generating an extended alignment of closely related sequences
with very different lengths? Hmmm - we'll work on that
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
9
Dynamic Programming
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
10
Global Alignment: Scoring
For Pairwise sequence alignment
CTGTCG-CTGCACG
-TGC-CG-TG----
Idea: Display one sequence above another with
spaces inserted in both to reveal similarity
Reward for matches: α
Mismatch penalty:
β
Space/gap penalty: γ
C A T - T C A - C
|
|
| |
|
C - T C G C A G C
Score = αw – βx - γy
w = #matches
x = #mismatches
y = #spaces
Note: I changed symbols
& colors on this slide!
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
BCB 444/544 Fall 07 Dobbs
9/5/07
11
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
12
2
#7 Still more DP, Scoring Matrices
9/5/07
Alignment Algorithms
Global Alignment: Scoring
Reward for matches:
Mismatch penalty:
Space/gap penalty:
C
-
T
T
G
G
T
C
C
–
G
C
–
G
10
-2
-5
C
–
T
T
• Global: Needleman-Wunsch
• Local: Smith-Waterman
G
G
• Both NW and SW use dynamic programming
• Variations:
• Gap penalty functions
• Scoring matrices
C
-
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Note: I changed symbols
& colors on this slide!
Total = 11
We could have done better!!
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
13
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
14
Global Alignment:
DP Problem Formulation & Notations
Dynamic Programming - Key Idea:
Given two sequences (strings)
The score of the best possible alignment that ends at a
• X = x 1x 2 …xN of length N
given pair of positions (i, j) is equal to:
• Y = y1y2 …yM of length M
the score of best alignment ending just previous to
x = AGC
N=3
y = AAAC
M=4
Construct a matrix with (N+1) x (M+1) elements, where
those two positions (i.e., ending at i-1, j-1)
S ( i,j) = Score of best alignment of x[1..i]=x1x2…x i with y[1..j]=y1 y2…yj
PLUS
x1
x2
x3
the score for aligning xi and yj
Which means:
S( i,j) = Score of best alignment of
a prefix of X and a prefix of Y
y1
S(2,3) = score of best alignment
y2
of AG (x1x2) to AAA (y1y2y3)
y3
y4
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
15
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
16
1- Define Score of Optimal Alignment
using Recursion
Dynamic Programming - 4 Steps:
Define:
1. Define score of optimal alignment, using recursion
2. Initialize and fill in a DP matrix for storing optimal
scores of subproblems, by solving smallest
subproblems first (bottom-up approach)
x1..i = Prefix of length i of x
y1.. j = Prefix of length j of y
S(i, j) = Score of optimal alignment of x1..i and y1..j
!
Initial
conditions:
3. Calculate score of optimal alignment(s)
S(i,0) = "i # $ S(0, j) = " j # $
!
4. Trace back through matrix to recover optimal
alignment(s) that generated optimal score
α = Match Reward
β = Mismatch Penalty
γ = Gap penalty
Recursive definition:
For 1 ≤ i ≤ N, 1 ≤ j ≤ M:
!
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
9/5/07
17
%S(i "1, j "1) + # (xi , y j )
'
S(i, j) = max&S(i "1, j)
"$
'S(i, j "1)
"$
(
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
σ(xi,yj) = α or β
γ
= Gap penalty
9/5/07
18
!
BCB 444/544 Fall 07 Dobbs
3
#7 Still more DP, Scoring Matrices
9/5/07
2- Initialize & Fill in DP Matrix for
Storing Optimal Scores ofSubproblems
How do we calculate S(i,j)?
i.e., Score for alignment of x[1..i] to y[1..j]?
• Construct sequence vs sequence matrix
• Fill in from [0,0] to [N,M] (row by row), calculating best
possible score for each alignment ending at residues at [i,j]
0
0
1
1
1 of 3 cases ⇒ optimal score for this subproblem:
xi aligns to yj
N
S(0,0)=0
S(i,j)
xi aligns to a gap
x1 x2 . . . xi-1 xi
x1 x2 . . . xi
y1 y2 . . . yj-1 yj
y1 y2 . . . yj
y1 y2 . . . yj-1 yj
S(i-1,j-1) + σ(xi,yj)
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
Case 1: Line up xi with yj
x: C
y: C
A
T
T
T
i-1
C G
C A
j-1
Case 2: Line up xi with space
x: C
y: C
A
-
T
T
T
T
C
C
Case 3: Line up yj with space
x: C
y: C
A
-
T
T
T
T
C
C
9/5/07
19
i
C
C
j
S(i,j-1)
-γ
A
0
Mismatch Penalty
1
0
i
C
-
β = Mismatch Penalty
γ = Gap penalty
Space Penalty
i
A C A - G
j -1 j
T
C
G
21
C
A
G
σ(xi ,yj) = α
or
S(i-1,j)
S(i,j-1)
S(i,j)
β
-γ
S(N,M)
Recursion
%S(i "1, j "1) + # (xi , y j )
'
S(i, j) = max&S(i "1, j)
"$
'S(i, j "1)
"$
(
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
!
C
5
λ
C
λ
㻓
㻐㻘
T
C
G
C
A
G
C
C
A
㻐㻘
㻔㻓
㻘
㻓
㻐㻘
㻐㻔 㻓
㻘
㻛
㻖
㻐㻕
㻐㻚
㻓
㻐㻘
㻐㻔 㻓
T
㻐㻔 㻘
㻓
㻔㻘
㻔㻓
㻘
㻓
㻐㻘
㻐㻕
㻐㻚
㻐㻘
㻔㻓
㻔㻖
㻛
㻖
㻐㻕
㻐㻚
㻐㻗
㻐㻔 㻓 㻐㻔 㻘 㻐㻕 㻓 㻐㻕 㻘 㻐㻖 㻓 㻐㻖 㻘 㻐㻗 㻓
㻐㻔 㻓 㻐㻔 㻘 㻐㻕 㻓 㻐㻕 㻘
T
-20
T
㻐㻕 㻓
C
A
-25
-30
C
A
㻐㻕 㻘 㻐㻔 㻓
㻘
㻕㻓
㻔㻘
㻔㻛
㻔㻖
㻛
㻖
C
㻐㻖 㻓 㻐㻔 㻘
㻓
㻔㻘
㻔㻛
㻔㻖
㻕㻛
㻕㻖
㻔㻛
-35
C
㻐㻖 㻘 㻐㻕 㻓
㻐㻘
㻔㻓
㻔㻖
㻕㻛
㻕㻖
㻕㻙
㻖㻖
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
BCB 444/544 Fall 07 Dobbs
9/5/07
22
3- Calculate Score S(N,M) of Optimal
Alignment - for Global Alignment
-5 -10 -15 -20 -25 -30 -35 -40
10
+
S(i-1,j-1)
-γ
Initialization
S(i,0) = "i # $
S(0, j) = " j # $
Space Penalty
9/5/07
20
N
M
Fill in the DP matrix !!
C
1
S(0,0)=0
α = Match Reward
i-1
A A G
j
9/5/07
Keep track of dependencies of scores (in a pointer matrix)
!
C
-5
A -10
T -15
-γ
—
Ready? Fill in DP Matrix
Scoring Consequence?
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
λ
S(i-1,j)
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
Note: I changed sequences
on this slide (to match the
rest of DP example)
Specific Example:
0
—
S(N,M)
M
λ
yj aligns to a gap
x1 x2 . . . xi-1 xi
+10 for match, -2 for mismatch, -5 for space
23
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
24
4
#7 Still more DP, Scoring Matrices
9/5/07
4- Trace back through matrix to recover
optimal alignment(s) that generated
the optimal score
Traceback - for Global Alignment
Start in lower right corner & trace back to upper left
How? "Repeat" alignment calculations in reverse order,
starting at from position with highest score and
following path, position by position, back through
matrix
Each arrow introduces one character at end of alignment:
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
Result? Optimal alignment(s) of sequences
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
25
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
Traceback to Recover Alignment
λ
C
T
C
G
C
A
G
C
λ
d 㻐㻘
㻐㻔 㻓
㻐㻔 㻘
㻐㻕 㻓
㻐㻕 㻘
㻐㻖 㻓
㻐㻖 㻘
㻐㻗 㻓
C
A
㻐㻘
㻔 㻓v
㻘
㻓
㻐㻘
㻐㻔 㻓
㻐㻔 㻘
㻐㻕 㻓
㻐㻕 㻘
㻐㻔 㻓
㻘
㻐㻕
㻐㻚
㻓
㻐㻘
㻐㻔 㻓
T
㻐㻔 㻘
㻓
㻔㻘
㻔㻓
d 㻘
㻓
㻐㻘
㻐㻕
㻐㻚
T
C
A
㻐㻕 㻓
㻐㻘
㻔 㻓d
㻔㻖
㻛
㻖
㻐㻕
㻐㻚
㻐㻗
㻐㻕 㻘
㻐㻔 㻓
㻘
㻕㻓
㻔㻘
㻔㻛
㻐㻖 㻓
㻐㻔 㻘
㻓
㻔㻘
㻔㻛
㻔㻖
㻕㻛
C
㻐㻖 㻘
㻐㻕 㻓
㻐㻘
㻔㻓
㻔㻖
㻕㻛
㻕㻖
h
㻖1
2
h
d
d
㻔㻖
h
26
9/5/07
28
What are the 2 Global Alignments
with Optimal Score = 33?
㻓
d㻛
9/5/07
㻛
1:
C
T
C
G
C
A
G
C
C
A
T
T
C
A
C
C
T
C
G
C
A
G
C
C
T
C
G
C
A
G
C
㻖
㻕㻖 d
㻔㻛
㻕㻙
㻖㻖
2:
Can have >1 optimal alignment; this example has 2
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
27
Local Alignment: Motivation
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
Local Alignment: Example
• To "ignore" stretches of non-coding DNA:
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
• Local alignment between two protein-encoding sequences is
likely to be between two exons
G G T C T G A G
A A A C G A
• To locate protein domains or motifs:
Match: +2
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
• Local sequence similarities may indicate ”functional modules”
Best local alignment:
Non-coding - "not encoding protein"
G G T C T G A G
A A A C – G A -
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
BCB 444/544 Fall 07 Dobbs
9/5/07
Mismatch or space: -1
29
Score = 5
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
30
5
#7 Still more DP, Scoring Matrices
9/5/07
Traceback - for Local Alignment
Local Alignment: Algorithm
λ
C
T
C
G
C
A
G
C
λ
0
0
0
0
0
0
0
0
0
C
A
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
Recall: for Global Alignment,
T
0
0
1
0
0
0
0
0
0
• S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y
• Initialize top row & leftmost column of with gap penalty
C
A
0
1
0
2
0
1
0
0
1
0
0
0
0
1
0
2
0
0
C
0
1
0
1
0
2
0
1
1
•S [i, j] = Score for optimally aligning a suffix of X with
a suffix of Y
• Initialize top row & leftmost column of matrix with "0"
+1 for a match, -1 for a mismatch, -5 for a space
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
31
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
T
C
G
C
A
G
C
2:
C
T
C
G
C
A
G
C
3:
C
T
C
G
C
A
G
C
4:
C
T
C
G
C
A
G
C
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
(for ComS, CprE & Math types!)
• Most pairwise sequence alignment problems can be
solved in O(mn) time
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
9/5/07
33
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
Affine Gap Penalty Functions
•
•
•
•
Total Gap Penalty is linear function of gap length:
where
γ
+
δ
X
(k - 1)
γ = gap opening penalty
δ = gap extension penalty
√ Global and Local Alignment
√ Alignment Algorithms
√ Dot Matrix Method
√ Dynamic Programming Method - cont
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
k = length of gap
• Statistical Significance of Sequence Alignment
Sometimes, a Constant Gap Penalty is used, but it is usually
least realistic than the Affine Gap Penalty
BCB 444/544 Fall 07 Dobbs
34
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
Can also be solved in
O(nm) time using DP
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
Methods
Affine Gap Penalties = Differential Gap Penalties
used to reflect cost differences between opening a
gap and extending an existing gap
W =
32
Some Results re: Alignment Algorithms
What are the 4 Local Alignments with
Optimal Score = 2?
C
C
9/5/07
9/5/07
35
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
36
6
#7 Still more DP, Scoring Matrices
9/5/07
"Scoring" or "Substitution" Matrices
PAM Matrix
2 Major types for Amino Acids: PAM & BLOSUM
PAM = Point Accepted Mutation
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
relies on "evolutionary model" based on observed
differences in closely related proteins
• Model includes defined rate for each type of
sequence change
• Suffix number (n) reflects amount of "time"
passed: rate of expected mutation if n% of amino
acids had changed
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
• PAM1 - for less divergent sequences (shorter time)
• PAM250 - for more divergent sequences (longer time)
37
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
9/5/07
38
9/5/07
40
BLOSUM62 Substitution Matrix
BLOSUM Matrix
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
average % aa identity in the MSA from which the
matrix was generated
s(a,b) corresponds to score of
aligning character a with
character b
Match scores are often calculated
based on frequency of mutations in
very similar sequences
(more details later)
• BLOSUM45 - for more divergent sequences
• BLOSUM62 - for less divergent sequences
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
BCB 444/544 Fall 07 Dobbs
9/5/07
39
BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices
7
Download