#6 -Scoring Matrices & Alignment 8/31/07 Statistics Lecture 6

advertisement
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
Required Reading
BCB 444/544
(before lecture)
Mon Aug 27 - for Lecture #4
Lecture 6
Pairwise Sequence Alignment
• Chp 3 - pp 31-41
Wed Aug 29 - for Lecture #5
Finish Dynamic Programming
Dynamic Programming
• Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909
Scoring Matrices
Alignment Statistics
http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html
Thurs Aug 30 - Lab #2:
Databases, ISU Resources & Pairwise Sequence Alignment
#6_Aug31
Fri Aug 31 - for Lecture #6
Scoring Matrices & Alignment Statistics
• Chp 3 - pp 41-49
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
1
Announcements
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
2
Chp 3- Sequence Alignment
SECTION II
Fri Aug 31 - Revised notes for Lecture 5 posted online
Changes? mainly re-ordering, symbols, color "coding"
SEQUENCE ALIGNMENT
Xiong: Chp 3
Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! - Enjoy!!
Pairwise Sequence Alignment
Tues Sept 4 - Lab #2 Exercise Writeup Due by 5 PM (or sooner!)
Send via email to Pete Zaback petez@iastate.edu
(HW#2 assignment will be posted online)
• √Evolutionary Basis
Fri Sept 14 - HW#2 Due by 5 PM (or sooner!)
• √Sequence Similarity versus Sequence Identity
• √Sequence Homology versus Sequence Similarity
• Methods -
Fri Sept 21 - Exam #1
cont
• Scoring Matrices
• Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
3
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
Sequence Homology vs Similarity
Methods
• √Global and Local Alignment
• Homologous sequences - sequences that share a common
• √Alignment Algorithms
• √Dot Matrix Method
• Similar sequences - sequences that have a high percentage of
evolutionary ancestry
aligned residues with similar physicochemical properties
• Dynamic Programming Method - cont
(e.g., size, hydrophobicity, charge)
• Gap penalities
• DP for Global Alignment
IMPORTANT:
• DP for Local Alignment
• Sequence homology:
• Scoring Matrices
• An inference about a common ancestral relationship, drawn when
two sequences share a high enough degree of sequence similarity
• Amino acid scoring matrices
• PAM
• Homology is qualitative
• BLOSUM
• Sequence similarity:
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 Fall 07 Dobbs
4
8/31/07
• The direct result of observation from a sequence alignment
5
• SimilarityBCBis444/544
quantitative
; can#6be
described
using percentages
F07 ISU Dobbs
- Scoring
Matrices & Alignment
Stats
8/31/07
6
1
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
Goal of Sequence Alignment
Statement of Problem
Find the best pairing of 2 sequences, such that
there is maximum correspondence between residues
• DNA
Given:
• 2 sequences
• Scoring system for evaluating match (or
mismatch) of two characters
4 letter alphabet (+ gap)
• Penalty function for gaps in sequences
TTGACAC
TTTACAC
Find: Optimal pairing of sequences that:
• Proteins
20 letter alphabet (+ gap)
• Retains the order of characters
• Introduces gaps where needed
RKVA-GMA
• Maximizes total score
RKIAVAMA
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
7
Avoiding Random Alignments with a
Scoring Function
8/31/07
8
Not All Mismatches are the Same
• Some amino acids are more "exchangeable" than
others (physicochemical properties are similar)
• Introducing too many gaps generates nonsense alignments:
s--e-----qu---en--ce
sometimesquipsentice
e.g., Ser & Thr are more similar than Trp & Ala
• Need to distinguish between alignments that occur due to
homology and those that occur by chance
• Substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions
• Define a scoring function that rewards matches (+) and
penalizes mismatches (-) and gaps (-)
Scoring Function (S):
Note: I changed
symbols & colors on this
slide!
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
e.g.
Match:
α
1
Mismatch:
β
1
Gap:
γ
0
• Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution
is "better" than any other (in general)
S = α(#matches) - β(#mismatches) - γ(#gaps)
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
9
Substitution Matrix
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
10
Global vs Local Alignment
Global alignment
s(a,b) corresponds to score of
aligning character a with
character b
• Finds best possible alignment across entire length of 2 sequences
• Aligned sequences assumed to be generally similar over entire length
Match scores are often calculated
based on frequency of mutations in
Local alignment
very similar sequences
• Finds local regions with highest similarity between 2 sequences
(more details later)
• Aligns these without regard for rest of sequence
• Sequences are not assumed to be similar over entire length
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 Fall 07 Dobbs
8/31/07
11
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
12
2
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
Global vs Local Alignment
Which should be used when?
Global vs Local Alignment - example
1 = CTGTCGCTGCACG
2 = TGCCGTG
Global alignment
CTGTCGCTGCACG
-TG-C-C-G--TG
It is critical to choose correct method!
Global Alignment
Local alignment
Local Alignment?
Shout out the answers!! Which should we use for?
CTGTCGCTGCACG
-TGCCG-TG----
1. Searching for conserved motifs in DNA or protein sequences?
2. Aligning two closely related sequences with similar lengths?
3. Aligning highly divergent sequences?
CTGTCGCTGCACG
-TGCCG-T----G
vs
4. Generating an extended alignment of closely related sequences?
5. Generating an extended alignment of closely related sequences
with very different lengths?
Hmmm - we'll work on that
Which is better?
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
13
Alignment Algorithms
Excellent!
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
14
Dot Matrix Method (Dot Plots)
3 major methods for pairwise sequence alignment:
• Place 1 sequence along top row of matrix
• Place 2nd sequence along left column of
matrix
1. Dot matrix analysis
• Plot a dot each time there is a match between
an element of row sequence and an element of
column sequence
2. Dynamic programming
• For proteins, usually use more sophisticated
scoring schemes than "identical match"
3. Word or k-tuple methods (later, in Chp 4)
• Diagonal lines indicate areas of match
A C G C G
A
C
A
C
G
• Contiguous diagonal lines reveal alignment;
"breaks" = gaps (indels)
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
15
Interpretation of Dot Plots
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
16
8/31/07
18
Dynamic Programming
For Pairwise sequence alignment
When comparing 2 sequences:
• Diagonal lines of dots indicate regions of similarity
between 2 sequences
Idea: Display one sequence above another with
spaces inserted in both to reveal similarity
• Reverse diagonals (perpendicular to diagonal) indicate
inversions
C A T - T C A - C
|
|
| |
|
C - T C G C A G C
• What do such patterns mean when
comparing a sequence with itself (or its
reverse complement)?
• e.g.: Reverse diagonals crossing
diagonals
(X's)
indicate
Exploring
Dot
Plots
palindromes
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 Fall 07 Dobbs
8/31/07
17
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
3
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
Global Alignment: Scoring
Global Alignment: Scoring
CTGTCG-CTGCACG
Reward for matches:
Mismatch penalty:
Space/gap penalty:
-TGC-CG-TG---Reward for matches: α
Mismatch penalty:
β
Space/gap penalty: γ
C
-
Score = αw – βx - γy
w = #matches
x = #mismatches
Note: I changed
symbols & colors on this
slide!
BCB 444/544 F07 ISU
T
T
G
G
T
C
C
–
G
C
–
G
10
-2
-5
C
–
T
T
G
G
C
-
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
y = #spaces
Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
19
Note: I changed
symbols & colors on this
slide!
Total = 11
We could have done better!!
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Alignment Algorithms
8/31/07
20
Dynamic Programming - Key Idea:
The score of the best possible alignment that ends at a
• Global: Needleman-Wunsch
given pair of positions (i, j) is equal to:
• Local: Smith-Waterman
the score of best alignment ending just previous to
those two positions (i.e., ending at i-1, j-1)
• Both NW and SW use dynamic programming
PLUS
• Variations:
• Gap penalty functions
the score for aligning xi and yj
• Scoring matrices
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
21
Global Alignment:
DP Problem Formulation & Notations
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
22
Dynamic Programming - 4 Steps:
Given two sequences (strings)
• X = x 1x 2 …xN of length N
x = AGC
N=3
• Y = y1y2 …yM of length M
y = AAAC
M=4
1. Define score of optimum alignment, using recursion
2. Initialize and fill in a DP matrix for storing optimal
scores of subproblems, by solving smallest
Construct a matrix with (N+1) x (M+1) elements, where
subproblems first (bottom-up approach)
S ( i,j) = Scorexof
best
of x[1..i]=x1x2…x i with y[1..j]=y1 y2…yj
x2 xalignment
1
3
Which means: Score of best alignment of
a prefix of X and a prefix of Y
3. Calculate score of optimum alignment(s)
4. Trace back through matrix to recover optimum
y1
y2
y3
S(2,3) = score of best alignment
alignment(s) that generated optimal score
of AG (x1x2) to AAA (y1y2y3)
y4
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 Fall 07 Dobbs
8/31/07
23
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
24
4
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
2- Initialize & Fill in DP Matrix for Storing
Optimal Scores of Subproblems
1- Define Score of Optimum Alignment
using Recursion
• Construct sequence vs sequence matrix:
x1..i = Prefix of length i of x
y1.. j = Prefix of length j of y
Define:
0
1
S(i, j) = Score of optimum alignment of x1..i and y1..j
Initial !
conditions:
S(i,0) = "i # $
!
S(i,j-1)
S(0, j) = " j # $
8/31/07
i
25
8/31/07
26
!
!
3- Calculate Score S(N,M) of Optimum
Alignment - for Global Alignment
• Fill in from [0,0] to [N,M] (row by row), calculating best
possible score for each alignment including residues at [i,j]
• Keep track of dependencies of scores (in a pointer matrix).
1
j
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Fill in DP Matrix
0
Initialization
S(i,0) = "i # $
S(0, j) = " j # $
%S(i "1, j "1) + # (x , y )
'
S(i, j) = max&S(i "1, j) " $
'S(i, j "1) " $
(
!
0
S(N,M)
M
%S(i "1, j "1) + # (xi , y j )
'
S(i, j) = max&S(i "1, j) " $
'S(i, j "1) " $
(
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
N
S(i,j)
Recursion
! 1 ≤ i ≤ N, 1 ≤ j ≤ M:
For
cont
1
S(0,0)=0
S(i-1,j-1) S(i-1,j)
Recursive definition:
2-
0
1
What happens in last step in alignment of x[1..i] to y[1..j]?
1 of 3 cases applies:
N
xi aligns to yj
S(0,0)=0
S(i-1,j-1)
S(i-1,j)
S(i,j-1)
S(i,j)
xi aligns to a gap
yj aligns to a gap
x1 x2 . . . xi-1 xi
x1 x2 . . . xi-1 xi
x1 x2 . . . xi
y1 y2 . . . yj-1 yj
y1 y2 . . . yj
y1 y2 . . . yj-1 yj
S(i-1,j-1) + σ(xi,yj)
S(i-1,j)
—
-γ
S(i,j-1)
—
-γ
S(N,M)
M
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
27
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
λ
Case 1: Line up x i with y j
A
-
T
T
T
T
C
C
i-1
A
A
j -1
Case 2: Line up x i with space
x: C
y: C
A
-
T
T
T
T
C
C
A
A
λ
i
C
G
j
i-1
G
Case 3: Line up y j with space
A
-
T
T
T
T
C
C
A
A
i
C
-
i
C
-
G
j -1
j
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 Fall 07 Dobbs
C
0
T
C
G
C
A
G
C
-5 -10 -15 -20 -25 -30 -35 -40
-5
C
A -10
j
x: C
y: C
28
Fill in the matrix
Example
x: C
y: C
8/31/07
T
-15
T
-20
C
A
-25
-30
C
-35
10
5
+10 for match, -2 for mismatch, -5 for space
8/31/07
29
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
30
5
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
Calculate score of optimum alignment
λ
C
T
C
G
C
A
G
4- Trace back through matrix to recover
optimum alignment(s) that generated
the optimal score
C
λ
0
-5
C
A
-5
10
-1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0
5
0
-5
-1 0
5
8
3
-2
-7
0
-5
-1 0
T
-1 5
0
15
10
5
0
-5
-2
-7
T
-2 0
-5
10
13
8
3
-2
-7
-4
C
-2 5 -1 0
5
20
15
18
13
8
3
A
-3 0 -1 5
0
15
18
13
28
23
18
C
-3 5 -2 0
-5
10
13
28
23
26
33
-1 0 -1 5 -2 0 -2 5
How? "Repeat" alignment calculations in reverse
order, starting at from position with highest score
and following path, position by position, back
through matrix
Result? Optimal alignment(s) of sequences
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
31
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
32
Traceback to Recover Alignment
Traceback - for Global Alignment
λ
Start in lower right corner & trace back to upper left
Each arrow introduces one character at end of sequence
alignment:
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
C
T
C
G
C
A
G
C
λ
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
C
A
-5
10
5
0
-5
-1 0
-1 5
-2 0
-2 5
-1 0
5
8
3
-2
-7
0
-5
-1 0
T
-1 5
0
15
10
5
0
-5
-2
-7
T
-2 0
-5
10 *
13
8
3
-2
-7
-4
C
A
-2 5
-1 0
5
20
15
18
13
8
3
-3 0
-1 5
0
15
18
13
28
23
18
C
-3 5
-2 0
-5
10
13
28
23
26
33
*
Can have >1 optimum alignment; this example has 2
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
33
Local Alignment: Motivation
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Local Alignment:
8/31/07
34
8/31/07
36
Example
• To "ignore" stretches of non-coding DNA:
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
g g t c t g a g
a a a c g a
• Local alignment between two protein-encoding sequences is
likely to be between two exons
Match: +2
• To locate protein domains or motifs:
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
Best local alignment:
• Local sequence
similarities
may indicate ”functional modules”
Non-coding
- "not encoding
protein"
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 Fall 07 Dobbs
8/31/07
Mismatch or space: -1
g g t c t g a g
a a a c – g a -
35
Score = 5
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
6
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
Traceback - for Local Alignment
Local Alignment: Algorithm
•S [i, j] = Score for optimally aligning a suffix of X with
a suffix of Y
• Initialize top row & leftmost column of matrix with "0"
Recall: for Global Alignment,
• S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y
• Initialize top row & leftmost column of with gap penalty
λ
C
T
C
G
C
A
G
C
λ
0
0
0
0
0
0
0
0
0
C
A
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
C
0
1
0
2
0
1
0
0
1
A
0
0
0
0
1
0
2
0
0
C
0
1
0
1
0
2
0
1
1
+1 for a match, -1 for a mismatch, -5 for a space
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
37
Some Results re: Alignment Algorithms
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
39
PAM Matrix
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
40
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
average % aa identity in the MSA from which the
matrix was generated
• PAM1 - for less divergent sequences (shorter time)
• BLOSUM45 - for more divergent sequences
• PAM250 - for more divergent sequences (longer time)
• BLOSUM62 - for less divergent sequences
BCB 444/544 Fall 07 Dobbs
8/31/07
BLOSUM Matrix
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differnces in closely related proteins
• Model includes defined rate for each type of
sequence change
• Suffix number (n) reflects amount of "time"
passed: rate of expected mutation if n% of amino
acids had changed
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
38
2 Major types for Amino Acids: PAM & BLOSUM
• Most pairwise sequence alignment problems can be
solved in O(mn) time
8/31/07
8/31/07
"Scoring" or "Substitution" Matrices
(for ComS, CprE & Math types!)
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
41
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
42
7
#6 -Scoring Matrices & Alignment
Statistics
8/31/07
Statistical Significance
of Sequence Alignment
Affine Gap Penalty Functions
Gap penalty = h + gk
where
k = length of gap
Can also be solved in
O(nm) time using
dynamic programming
h = gap opening penalty
g = gap extension penalty
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BCB 444/544 Fall 07 Dobbs
8/31/07
43
BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
8/31/07
44
8
Download