#6 -More DP: Global vs Local 8/31/07 Alignment BCB 444/544

advertisement
#6 -More DP: Global vs Local
Alignment
8/31/07
Required Reading
BCB 444/544
(before lecture)
Mon Aug 27 - for Lecture #4
Lecture 6
Pairwise Sequence Alignment
• Chp 3 - pp 31-41
Try to Finish Dynamic Programming
Wed Aug 29 - for Lecture #5
Global & Local Alignment
Dynamic Programming
• Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909
http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html
Next lecture:
Thurs Aug 30 - Lab #2:
Scoring Matrices
Alignment Statistics
Databases, ISU Resources & Pairwise Sequence Alignment
Fri Aug 31 - for Lecture #6
#6_Aug31
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
Scoring Matrices & Alignment Statistics
• Chp 3 - pp 41-49
8/31/07
1
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
Announcements
SECTION II
Pairwise Sequence Alignment
Tues Sept 4 - Lab #2 Exercise Writeup Due by 5 PM (or sooner!)
Send via email to Pete Zaback petez@iastate.edu
(HW#2 assignment will be posted online)
•
•
•
•
•
•
Fri Sept 14 - HW#2 Due by 5 PM (or sooner!)
Fri Sept 21 - Exam #1
8/31/07
3
√ Global and Local Alignment
√ Alignment Algorithms
√ Dot Matrix Method
Dynamic Programming Method - cont
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
4
• Homologous sequences - sequences that share a common
evolutionary ancestry
• Similar sequences - sequences that have a high percentage of
aligned residues with similar physicochemical properties
(e.g., size, hydrophobicity, charge)
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
IMPORTANT:
• Sequence homology:
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• An inference about a common ancestral relationship, drawn when
two sequences share a high enough degree of sequence similarity
• Homology is qualitative
• Sequence similarity:
• Statistical Significance of Sequence Alignment
BCB 444/544 Fall 07 Dobbs
√ Evolutionary Basis
√ Sequence Homology versus Sequence Similarity
√ Sequence Similarity versus Sequence Identity
Methods - cont
Scoring Matrices
Statistical Significance of Sequence Alignment
Sequence Homology vs Similarity
Methods
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
SEQUENCE ALIGNMENT
Xiong: Chp 3
Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! - Enjoy!!
•
•
•
•
2
Chp 3- Sequence Alignment
Fri Aug 31 - Revised notes for Lecture 5 posted online
Changes? mainly re-ordering, symbols, color "coding"
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
8/31/07
• The direct result of observation from a sequence alignment
• Similarity is quantitative; can be described using percentages
5
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
6
1
#6 -More DP: Global vs Local
Alignment
8/31/07
Goal of Sequence Alignment
Statement of Problem
Find the best pairing of 2 sequences, such that there
is maximum correspondence between residues
• DNA
Given:
• 2 sequences
• Scoring system for evaluating match (or
mismatch) of two characters
• Penalty function for gaps in sequences
4 letter alphabet (+ gap)
TTGACAC
TTTACAC
• Proteins
Find: Optimal pairing of sequences that:
• Retains the order of characters
• Introduces gaps where needed
• Maximizes total score
20 letter alphabet (+ gap)
RKVA-GMA
RKIAVAMA
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
7
Avoiding Random Alignments with a
Scoring Function
e.g., Ser & Thr are more similar than Trp & Ala
• Need to distinguish between alignments that occur due to
homology and those that occur by chance
• Define a scoring function that rewards matches (+) and
penalizes mismatches (-) and gaps (-)
Note: I changed symbols
& colors on this slide!
8
• Some amino acids are more "exchangeable" than
others (physicochemical properties are similar)
s--e-----qu---en--ce
sometimesquipsentice
Match:
Mismatch:
Gap:
8/31/07
Not All Mismatches are the Same
• Introducing too many gaps generates nonsense alignments:
Scoring Function (S):
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
α
β
γ
• Substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions
e.g.
1
1
0
• Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution is
"better" than any other (in general)
S = α(#matches) - β(#mismatches) - γ(#gaps)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
9
Substitution Matrix
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
10
Global vs Local Alignment
Global alignment
s(a,b) corresponds to score of
aligning character a with
character b
• Finds best possible alignment across entire length of 2 sequences
• Aligned sequences assumed to be generally similar over entire length
Match scores are often calculated
based on frequency of mutations in
very similar sequences
(more details later)
Local alignment
• Finds local regions with highest similarity between 2 sequences
• Aligns these without regard for rest of sequence
• Sequences are not assumed to be similar over entire length
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
BCB 444/544 Fall 07 Dobbs
8/31/07
11
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
12
2
#6 -More DP: Global vs Local
Alignment
8/31/07
Global vs Local Alignment
Which should be used when?
Global vs Local Alignment - example
1 = CTGTCGCTGCACG
2 = TGCCGTG
Global alignment
CTGTCGCTGCACG
-TG-C-C-G--TG
It is critical to choose correct method!
Global Alignment
Local alignment
CTGTCGCTGCACG
-TGCCG-TG----
1.
2.
3.
4.
5.
Which is better?
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
Local Alignment?
Shout out the answers!! Which should we use for?
CTGTCGCTGCACG
-TGCCG-T----G
vs
Searching for conserved motifs in DNA or protein sequences?
Aligning two closely related sequences with similar lengths?
Aligning highly divergent sequences?
Generating an extended alignment of closely related sequences?
Generating an extended alignment of closely related sequences
with very different lengths?
Hmmm - we'll work on that
Excellent!
8/31/07
13
Alignment Algorithms
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
14
Dot Matrix Method (Dot Plots)
3 major methods for pairwise sequence alignment:
• Place 1 sequence along top row of matrix
• Place 2nd sequence along left column of
matrix
• Plot a dot each time there is a match between
an element of row sequence and an element of
column sequence
1. Dot matrix analysis
2. Dynamic programming
3. Word or k-tuple methods (later, in Chp 4)
• For proteins, usually use more sophisticated
scoring schemes than "identical match"
• Diagonal lines indicate areas of match
• Contiguous diagonal lines reveal alignment;
"breaks" = gaps (indels)
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
8/31/07
15
Interpretation of Dot Plots
A C G C G
A
C
A
C
G
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
16
8/31/07
18
Dynamic Programming
For Pairwise sequence alignment
When comparing 2 sequences:
• Diagonal lines of dots indicate regions of similarity
between 2 sequences
• Reverse diagonals (perpendicular to diagonal) indicate
inversions
Idea: Display one sequence above another with
spaces inserted in both to reveal similarity
• What do such patterns mean when comparing
a sequence with itself (or its reverse
complement)?
C A T - T C A - C
|
|
| |
|
C - T C G C A G C
• e.g.: Reverse diagonals crossing diagonals (X's) indicate
palindromes
Exploring Dot Plots
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
BCB 444/544 Fall 07 Dobbs
8/31/07
17
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
3
#6 -More DP: Global vs Local
Alignment
8/31/07
Global Alignment: Scoring
Global Alignment: Scoring
CTGTCG-CTGCACG
Reward for matches:
Mismatch penalty:
Space/gap penalty:
-TGC-CG-TG---Reward for matches: α
Mismatch penalty:
β
Space/gap penalty: γ
C
-
Score = αw – βx - γy
w = #matches
x = #mismatches
T
T
G
G
T
C
Note: I changed symbols
& colors on this slide!
Note: I changed symbols
& colors on this slide!
8/31/07
G
C
–
G
C
–
T
T
G
G
C
-
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
y = #spaces
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
C
–
10
-2
-5
19
Total = 11
We could have done better!!
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
Alignment Algorithms
8/31/07
20
Dynamic Programming - Key Idea:
The score of the best possible alignment that ends at a
• Global: Needleman-Wunsch
• Local: Smith-Waterman
given pair of positions (i, j) is equal to:
the score of best alignment ending just previous to
those two positions (i.e., ending at i-1, j-1)
• Both NW and SW use dynamic programming
• Variations:
• Gap penalty functions
• Scoring matrices
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
PLUS
the score for aligning xi and yj
8/31/07
21
Global Alignment:
DP Problem Formulation & Notations
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
22
Dynamic Programming - 4 Steps:
Given two sequences (strings)
• X = x 1x 2 …xN of length N
• Y = y1y2 …yM of length M
x = AGC
N=3
y = AAAC
M=4
1. Define score of optimum alignment, using recursion
2. Initialize and fill in a DP matrix for storing optimal
scores of subproblems, by solving smallest
subproblems first (bottom-up approach)
Construct a matrix with (N+1) x (M+1) elements, where
S ( i,j) = Score of best alignment of x[1..i]=x1x2…x i with y[1..j]=y1 y2…yj
x1
x2
x3
3. Calculate score of optimum alignment(s)
Which means: Score of best alignment of
a prefix of X and a prefix of Y
4. Trace back through matrix to recover optimum
alignment(s) that generated optimal score
y1
y2
y3
S(2,3) = score of best alignment
of AG (x1x2) to AAA (y1y2y3)
y4
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
BCB 444/544 Fall 07 Dobbs
8/31/07
23
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
24
4
#6 -More DP: Global vs Local
Alignment
8/31/07
2- Initialize & Fill in DP Matrix for Storing
Optimal Scores of Subproblems
1- Define Score of Optimum Alignment
using Recursion
Define:
• Construct sequence vs sequence matrix:
x1..i = Prefix of length i of x
y1.. j = Prefix of length j of y
0
1
S(i, j) = Score of optimum alignment of x1..i and y1..j
0
1
S(i-1,j-1) S(i-1,j)
Initial !
conditions:
S(i,j-1)
S(i,0) = "i # $ S(0, j) = " j # $
!
Recursive definition:
8/31/07
i
25
1
8/31/07
26
!
3- Calculate Score S(N,M) of Optimum
Alignment - for Global Alignment
Fill in DP Matrix
0
j
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
!
• Fill in from [0,0] to [N,M] (row by row), calculating best
possible score for each alignment including residues at [i,j]
• Keep track of dependencies of scores (in a pointer matrix).
0
Initialization
S(i,0) = "i # $
S(0, j) = " j # $
%S(i "1, j "1) + # (x , y )
'
S(i, j) = max&S(i "1, j) " $
'S(i, j "1) " $
(
!
cont
S(N,M)
M
%S(i "1, j "1) + # (xi , y j )
'
S(i, j) = max&S(i "1, j) " $
'S(i, j "1) " $
(
2-
S(i,j)
Recursion
For
! 1 ≤ i ≤ N, 1 ≤ j ≤ M:
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
N
S(0,0)=0
1
N
S(i-1,j)
S(i,j-1)
S(i,j)
1 of 3 cases applies:
xi aligns to yj
S(0,0)=0
S(i-1,j-1)
What happens in last step in alignment of x[1..i] to y[1..j]?
xi aligns to a gap
yj aligns to a gap
x1 x2 . . . xi-1 xi
x1 x2 . . . xi-1 xi
x1 x2 . . . xi
y1 y2 . . . yj-1 yj
y1 y2 . . . yj
y1 y2 . . . yj-1 yj
S(i-1,j-1) + σ(xi,yj)
S(i-1,j)
—
-γ
S(i,j-1)
—
-γ
S(N,M)
M
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
27
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
λ
Case 1: Line up x i with y j
A
-
T
T
T
T
C
C
i-1
A
A
j -1
Case 2: Line up x i with space
x: C
y: C
A
-
T
T
T
T
C
C
A
A
λ
i
i-1
G
Case 3: Line up y j with space
A
-
T
T
T
T
C
C
A
A
i
C
j -1
0
i
C
-
G
j
T
C
G
C
A
G
C
-5 -10 -15 -20 -25 -30 -35 -40
10
C
A -10
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
BCB 444/544 Fall 07 Dobbs
C
-5
C
G
j
j
x: C
y: C
28
Fill in the matrix
Example
x: C
y: C
8/31/07
T
-15
T
-20
C
A
-25
-30
C
-35
5
+10 for match, -2 for mismatch, -5 for space
8/31/07
29
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
30
5
#6 -More DP: Global vs Local
Alignment
8/31/07
Calculate score of optimum alignment
λ
C
T
C
G
C
A
G
4- Trace back through matrix to recover
optimum alignment(s) that generated
the optimal score
C
λ
0
-5
C
A
-5
10
-1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0
5
0
-5
-1 0
5
8
3
-2
-7
0
-5
-1 0
T
-1 5
0
15
10
5
0
-5
-2
-7
T
-2 0
-5
10
13
8
3
-2
-7
-4
C
A
-2 5 -1 0
5
20
15
18
13
8
3
-3 0 -1 5
0
15
18
13
28
23
18
C
-3 5 -2 0
-5
10
13
28
23
26
33
-1 0 -1 5 -2 0 -2 5
How? "Repeat" alignment calculations in reverse order,
starting at from position with highest score and
following path, position by position, back through
matrix
Result? Optimal alignment(s) of sequences
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
31
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
32
Traceback to Recover Alignment
Traceback - for Global Alignment
λ
Start in lower right corner & trace back to upper left
Each arrow introduces one character at end of sequence
alignment:
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
C
T
C
G
C
A
G
C
λ
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
C
A
-5
10
5
0
-5
-1 0
-1 5
-2 0
-2 5
-1 0
5
8
3
-2
-7
0
-5
-1 0
T
-1 5
0
15
10
5
0
-5
-2
-7
T
-2 0
-5
10 *
13
8
3
-2
-7
-4
C
A
-2 5
-1 0
5
20
15
18
13
8
3
-3 0
-1 5
0
15
18
13
28
23
18
C
-3 5
-2 0
-5
10
13
28
23
26
33
*
Can have >1 optimum alignment; this example has 2
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
33
What are the 2 Alignments with
Optimum Score = 33?
1:
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
34
Local Alignment: Motivation
• To "ignore" stretches of non-coding DNA:
C
T
C
G
C
A
G
C
A
T
T
C
A
C
C
T
C
G
C
A
G
C
C
T
C
G
C
A
G
C
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
• Local alignment between two protein-encoding sequences is
likely to be between two exons
C
• To locate protein domains or motifs:
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
• Local sequence similarities may indicate ”functional modules”
Non-coding - "not encoding protein"
2:
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
BCB 444/544 Fall 07 Dobbs
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
8/31/07
35
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
36
6
#6 -More DP: Global vs Local
Alignment
Local Alignment:
8/31/07
Example
Local Alignment: Algorithm
•S [i, j] = Score for optimally aligning a suffix of X with
a suffix of Y
g g t c t g a g
a a a c g a
Match: +2
• Initialize top row & leftmost column of matrix with "0"
Mismatch or space: -1
Recall: for Global Alignment,
Best local alignment:
• S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y
• Initialize top row & leftmost column of with gap penalty
g g t c t g a g
a a a c – g a -
Score = 5
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
37
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2
0
1
0
0
1
0
0
0
0
1
0
2
0
0
0
1
0
1
0
2
0
1
1
C
A
C
A
C
8/31/07
38
Some Results re: Alignment Algorithms
Traceback - for Local Alignment
λ
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
(for ComS, CprE & Math types!)
• Most pairwise sequence alignment problems can be
solved in O(mn) time
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
+1 for a match, -1 for a mismatch, -5 for a space
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
39
"Scoring" or "Substitution" Matrices
40
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differnces in closely related proteins
• Model includes defined rate for each type of
sequence change
• Suffix number (n) reflects amount of "time"
passed: rate of expected mutation if n% of amino
acids had changed
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
BCB 444/544 Fall 07 Dobbs
8/31/07
PAM Matrix
2 Major types for Amino Acids: PAM & BLOSUM
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
• PAM1 - for less divergent sequences (shorter time)
• PAM250 - for more divergent sequences (longer time)
41
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
42
7
#6 -More DP: Global vs Local
Alignment
8/31/07
Statistical Significance
of Sequence Alignment
BLOSUM Matrix
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
average % aa identity in the MSA from which the
matrix was generated
• BLOSUM45 - for more divergent sequences
• BLOSUM62 - for less divergent sequences
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
43
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
8/31/07
44
Affine Gap Penalty Functions
Gap penalty = h + gk
where
k = length of gap
h = gap opening penalty
g = gap extension penalty
Can also be solved in
O(nm) time using
dynamic programming
BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment
BCB 444/544 Fall 07 Dobbs
8/31/07
45
8
Download