#8 Finish DP, Scoring Matrices, Stats 9/7/07 & BLAST BCB 444/544

advertisement
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
Required Reading
BCB 444/544
(before lecture)
√ Last week: - for Lectures 4-7
Lecture 8
Pairwise Sequence Alignment, Dynamic Programming,
Global vs Local Alignment, Scoring Matrices, Statistics
• Xiong: Chp 3
• Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909
Finish: Dynamic Programming
Global vs Local Alignment
http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html
Scoring Matrices & Alignment Statistics
√ Wed Sept 5 - for Lecture 7 & Lab 3
BLAST
Database Similarity Searching: BLAST (nope, more DP)
• Chp 4 - pp 51-62
#8_Sept7
Fri Sept 7 - for Lecture 8 (will finish on Monday)
BLAST variations; BLAST vs FASTA
• Chp 4 - pp 51-62
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
1
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
Assignments & Announcements
SECTION II
Send via email to Pete Zaback petez@iastate.edu
( For now, no late penalty - just send ASAP)
Fri Sept 21
- Exam #1
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
SEQUENCE ALIGNMENT
Xiong: Chp 3
Pairwise Sequence Alignment
√ Wed Sept 5 - Notes for Lecture 5 posted online
- HW#2 posted online & sent via email
& handed out in class
- HW#2 Due by 5 PM
2
Chp 3- Sequence Alignment
√ Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PM
Fri Sept 14
9/7/07
•
•
•
•
•
•
3
Methods
√ Evolutionary Basis
√ Sequence Homology versus Sequence Similarity
√ Sequence Similarity versus Sequence Identity
Methods - cont
Scoring Matrices
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
4
Dynamic Programming - 4 Steps:
• √ Global and Local Alignment
• √ Alignment Algorithms
• √ Dot Matrix Method
1. Define score of optimal alignment, using recursion
2. Initialize and fill in a DP matrix for storing optimal
scores of subproblems, by solving smallest
subproblems first (bottom-up approach)
• Dynamic Programming Method - cont
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
• Scoring Matrices
3. Calculate score of optimal alignment(s)
4. Trace back through matrix to recover optimal
alignment(s) that generated optimal score
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
BCB 444/544 Fall 07 Dobbs
9/7/07
5
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
6
1
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
1- Define Score of Optimal Alignment
using Recursion
2- Initialize & Fill in DP Matrix for
Storing Optimal Scores of Subproblems
• Construct sequence vs sequence matrix
• Fill in from [0,0] to [N,M] (row by row), calculating best
possible score for each alignment ending at residues at [i,j]
x1..i = Prefix of length i of x
y1.. j = Prefix of length j of y
Define:
S(i, j) = Score of optimal alignment of x1..i and y1..j
!
Initial
conditions:
1
β = Mismatch Penalty
γ = Gap penalty
S(i,0) = "i # $ S(0, j) = " j # $
!
0
0
α = Match Reward
1
N
S(0,0)=0
Recursive definition:
S(i,j)
For 1 ≤ i ≤ N, 1 ≤ j ≤ M:
!
%S(i "1, j "1) + # (xi , y j )
'
S(i, j) = max&S(i "1, j)
"$
'S(i, j "1)
"$
(
σ(xi,yj) = α or β
γ
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
= Gap penalty
9/7/07
S(N,M)
M
7
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
8
!
How do we calculate S(i,j)?
i.e., Score for alignment of x[1..i] to y[1..j]?
Case 1: Line up xi with yj
1 of 3 cases ⇒ optimal score for this subproblem:
x: C
y: C
xi aligns to yj
xi aligns to a gap
yj aligns to a gap
x1 x2 . . . xi-1 xi
x1 x2 . . . xi-1 xi
x1 x2 . . . xi
y1 y2 . . . yj-1 yj
y1 y2 . . . yj
y1 y2 . . . yj-1 yj
S(i-1,j-1) + σ(xi,yj)
—
S(i-1,j)
-γ
S(i,j-1)
—
x: C
y: C
9
1
λ
1
S(0,0)=0
α = Match Reward
β = Mismatch Penalty
γ = Gap penalty
σ(xi ,yj) = α
Initialization
S(i,0) = "i # $
S(0, j) = " j # $
or
S(i-1,j)
S(i,j-1)
S(i,j)
-γ
S(N,M)
Recursion
%S(i "1, j "1) + # (xi , y j )
'
S(i, j) = max&S(i "1, j)
"$
'S(i, j "1)
"$
(
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
!
0
C
-5
A -10
T -15
β
S(i-1,j-1)
-γ
M
λ
N
+
A
T
T
C
-
G
T
C
C
A
T
T
C
-
G
T
C
C
Match Bonus
i-1
A
j
i
A
-
Space Penalty
i
A - A
j -1 j
Space Penalty
9/7/07
10
Fill in the DP matrix !!
Keep track of dependencies of scores (in a pointer matrix)
0
C
-
i
A
A
j
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
Ready? Fill in DP Matrix
0
T
T
Case 3: Line up yj with space
-γ
9/7/07
A
Scoring Consequence?
i-1
G C
T C
j-1
Case 2: Line up xi with space
x: C
y: C
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
Note: I changed sequences
on this slide (to match the
rest of DP example)
Specific Example:
9/7/07
T
-20
C
A
-25
-30
C
-35
C
T
C
G
C
A
G
C
-5 -10 -15 -20 -25 -30 -35 -40
10
5
+10 for match, -2 for mismatch, -5 for space
11
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
12
!
BCB 444/544 Fall 07 Dobbs
2
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
4- Trace back through matrix to recover
optimal alignment(s) that generated
the optimal score
3- Calculate Score S(N,M) of Optimal
Alignment - for Global Alignment
λ
C
λ
0
-5
C
A
-5
10
-1 0
5
T
-1 5
0
T
-2 0
-5
C
A
C
T
C
G
C
A
G
C
-1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0
5
0
-5
8
3
-2
-7
0
-5
-1 0
15
10
5
0
-5
-2
-7
10
13
8
3
-2
-7
-4
-2 5 -1 0
5
20
15
18
13
8
3
-3 0 -1 5
0
15
18
13
28
23
18
-3 5 -2 0
-5
10
13
28
23
26
33
How? "Repeat" alignment calculations in reverse order,
starting at from position with highest score and
following path, position by position, back through
matrix
-1 0 -1 5 -2 0 -2 5
Result? Optimal alignment(s) of sequences
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
13
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
14
Traceback to Recover Alignment
Traceback - for Global Alignment
Start in lower right corner & trace back to upper left
Each arrow introduces one character at end of alignment:
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
λ
C
T
C
G
C
A
G
C
λ
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
C
A
-5
10
5
0
-5
-1 0
-1 5
-2 0
-2 5
-1 0
5
8
3
-2
-7
0
-5
-1 0
T
-1 5
0
15
10
5
0
-5
-2
-7
T
C
A
C
-2 0
-5
10
13
8
3
-2
-7
-4
-2 5
-1 0
5
20
15
18
13
8
3
-3 0
-1 5
0
15
18
13
28
23
18
-3 5
-2 0
-5
10
13
28
23
26
33
Can have >1 optimal alignment; this example has 2
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
15
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
Traceback to Recover Alignment
λ
C
A
T
T
C
A
C
C
T
C
G
C
A
G
C
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
10
5
0
-5
-1 0
-1 5
-2 0
5
8
3
-2
-7
0
-5
-1 0
-1 5
0
15
10
5
0
-5
-2
-7
-5
10
13
8
3
-2
-7
-4
-2 5
-1 0
5
20
15
18
13
8
3
-3 0
-1 5
0
15
18
13
28
23
18
-3 5
-2 0
-5
10
13
28
23
26
λ
C
A
T
T
-2 5
-1 0
-2 0
33
BCB 444/544 Fall 07 Dobbs
9/7/07
λ
C
T
C
G
C
A
G
C
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
-5
10
5
0
-5
-1 0
-1 5
-2 0
-2 5
-1 0
5
8
3
-2
-7
0
-5
-1 0
-1 5
0
15
10
5
0
-5
-2
-7
-2 0
-5
10
13
8
3
-2
-7
-4
C
A
-2 5
-1 0
5
20
15
18
13
8
3
-3 0
-1 5
0
15
18
13
28
23
18
C
-3 5
-2 0
-5
10
13
28
23
26
33
+10 for match, -2 for mismatch, -5 for space
• Where did 33 come from? Match = 10, so 33-10= 23
Must have come from diagonal
• Where did 23 come from? (Not a match)
Left? 28-5= 23; Diag? 13-2= 11; Top? 8-5= 3
Where did red arrows come from?
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
16
Traceback to Recover Alignment
λ
-5
9/7/07
17
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
18
3
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
Traceback to Recover Alignment
λ
λ
C
A
Traceback to Recover Alignment
C
T
C
G
C
A
G
C
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
λ
C
T
C
G
C
A
G
C
-5
10
5
0
-5
-1 0
-1 5
-2 0
-2 5
λ
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
C
A
T
-5
10 C
C with
5
0
-5
-1 0
-1 5
-2 0
-2 5
-1 0
5
- with
8 A
3
-2
-7
0
-5
-1 0
-1 5
0
5 T
T 1with
10
5
0
-5
-2
-7
-2 0
-5
10
13 C with
3
-2
-7
-4
-2 5
-1 0
5
20
15
13
8
-3 0
-1 5
0
15
18
13
28A
A with
-3 5
-2 0
-5
10
13
28
23
-1 0
5
8
3
-2
-7
0
-5
-1 0
T
T
-1 5
0
15
10
5
0
-5
-2
-7
-2 0
-5
10
13
8
3
-2
-7
-4
C
A
-2 5
-1 0
5
20
15
18
13
8
3
-3 0
-1 5
0
15
18
13
28
23
18
C
-3 5
-2 0
-5
10
13
28
23
26
33
T
C
A
C
+10 for match, -2 for mismatch, -5 for space
Where did 8 come from?
Two possibilities: 13-5= 8 or 10-2=8
Then, follow both paths
•
•
9/7/07
19
λ
C
T
C
G
C
A
G
C
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
10 C
C with
5
0
-5
-1 0
-1 5
-2 0
-2 5
5
- with
8 A
3
-2
-7
0
-5
-1 0
0
5 T
T 1with
10
5
0
-5
-2
-7
-2
-7
-4
13
8
3
C
A
T
-5
-1 0
-1 5
T
-2 0
-5
10
C
A
-2 5
-1 0
5
20
G with
15 -
-3 0
-1 5
0
15
18
13
28A
A with
C
-3 5
-2 0
-5
10
13
28
23
3 T
C1with
8
3
C1with
8 C
3 G2 with
1:
2:
1:
2:
C
C
A
T
T
C
T
G
T
G
-
C
C
C
C
A
A
A
A
G
G
-
9/7/07
21
BCB 444/544 Fall 07 Dobbs
-
T
C
G
C
A
G
C
C
-
T
C
G
C
A
G
C
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
22
or, Check Traceback?
λ
C
A
T
T
C
A
C
C
C
C
C
9/7/07
λ
C
T
C
G
C
A
G
C
0
d -5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
5
0
-5
-1 0
-1 5
-2 0
-2 5
-2
-7
0
-5
-1 0
10
d 5
0
-5
-2
-7
8
3
-2
-7
-4
13
8
-5
1 0v
-1 0
5
-1 5
0
15
-2 0
-5
1 0d
13
-2 5
-1 0
5
20
-3 0
-1 5
0
-3 5
-2 0
-5
d8
h
31
h
d
d
15
18
15
18
13
28
10
13
28
23
2
h
3
23 d
18
26
33
• h= horizontal move puts a gap in left sequence
• v = vertical move puts a gap in top sequence
• d = diagonal move uses one character from each sequence
Check the scores: +10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
C
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
Top: C T C G C A G C
Left: C A T T C A C
C
-
20
18
What are the 2 Global Alignments
with Optimal Score = 33?
T
T
9/7/07
C3
with
3 C
26
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
A
C with
33 C
26
Top: C T C G C A G C
Left: C A T T C A C
Great - but what are the alignments? #2
C
C
18
What are the 2 Global Alignments
with Optimal Score = 33?
-1 0
-5
3
3 G2 with
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
Traceback to Recover Alignment
0
8 C
C1with
Great - but what are the alignments? #1
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
λ
8 T
G with
23
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
24
4
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
Local Alignment: Motivation
Local Alignment: Example
• To "ignore" stretches of non-coding DNA:
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
• Local alignment between two protein-encoding sequences is
likely to be between two exons
G G T C T G A G
A A A C G A
• To locate protein domains or motifs:
Match: +2
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
• Local sequence similarities may indicate ”functional modules”
Best local alignment:
Non-coding - "not encoding protein"
G G T C T G A G
A A A C – G A -
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
25
26
New Slide
S(i,0) = 0 S(0, j) = 0
S (0,0) = 0
2) Fill in DP matrix:
In local alignment, no negative scores
Assign "0" to cells with negative scores
%
'S i "1, j "1 + # x , y
) ( i j)
' (
S (i, j ) = max&S (i "1, j ) " $
!
'S (i, j "1) " $
'
(0
3) Optimal score? in highest scoring cell(s)
4) Optimal alignment(s)? Traceback from each cell
containing the optimal score, until a cell with "0" is
reached (not just from lower right corner)
9/7/07
9/7/07
Local Alignment DP:
Initialization & Recursion
1) Initialize top row & leftmost column of matrix with "0"
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
Score = 5
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
This slide has
been changed!
Local Alignment: Algorithm
Mismatch or space: -1
27
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
28
!
Filling in DP Matrix for Local Alignment
No negative scores - fill in "0"
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2 1
0
0
T
0
0
1
0
0
0
0
1
0
0
T
0
0
1
0
0
0
0
0
0
0
1
1
0
2 4 0
1
0
0
1
2
0
0
C
A
0
0
0
0
0
1
0
2 2 0
0
0
1
1
0
1
0
1
0
2
0
1
λ
C
T
C
G
C
A
G
C
λ
0
0
0
0
0
0
0
0
0
C
A
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
1
0
2
0
1
0
0
0
0
0
1
0
0
1
0
1
0
2
C
A
C
Traceback - for Local Alignment
λ
C
A
C
BCB 444/544 Fall 07 Dobbs
9/7/07
1
+1 for match, -1 for mismatch, -5 for space
+1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
3
29
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
30
5
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
What are the 4 Local Alignments with
Optimal Score = 2?
What are the 4 Local Alignments with
Optimal Score = 2?
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
T
C
G
C
A
G
C
2:
C
T
C
G
C
A
G
C
3:
C
T
C
G
C
A
G
C
4:
C
T
C
G
C
A
G
C
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
31
C
T
G
T
C
C
A
A
G
C
C
1:
C
-
T
-
C
-
G
-
C
C
A
A
G
T
C
T
2:
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
3:
C
T
T
T
C
C
G
A
C
C
A
G
C
4:
C
T
T
T
C
C
G
A
C
C
A
G
C
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
32
Affine Gap Penalty Functions
(for ComS, CprE & Math types)
Affine Gap Penalties = Differential Gap Penalties
used to reflect cost differences between opening a
gap and extending an existing gap
• Most pairwise sequence alignment problems can be
solved in O(mn) time
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
Total Gap Penalty is linear function of gap length:
W =
where
for Biologists: Big O notation
• used when analyzing algorithms for efficiency
• refers to time or number of steps it takes to
solve a problem
• expressed as a function of size of the problem
γ
+
δ
X
(k - 1)
γ = gap opening penalty
δ = gap extension penalty
Can also be solved in
O(nm) time using DP
k = length of gap
Sometimes, a Constant Gap Penalty is used, but it is usually
least realistic than the Affine Gap Penalty
9/7/07
33
Methods
•
•
•
•
T
A
Check the scores: +1 for match, -1 for mismatch, -5 for space
9/7/07
Some Results re: Alignment Algorithms
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
C
C
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
34
"Scoring" or "Substitution" Matrices
√ Global and Local Alignment
√ Alignment Algorithms
√ Dot Matrix Method
√ Dynamic Programming Method - cont
2 Major types for Amino Acids: PAM & BLOSUM
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
BCB 444/544 Fall 07 Dobbs
9/7/07
35
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
36
6
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
PAM Matrix
BLOSUM Matrix
PAM = Point Accepted Mutation
BLOSUM = BLOck SUbstitution Matrix
relies on "evolutionary model" based on observed
differences in closely related proteins
• Model includes defined rate for each type of
sequence change
• Suffix number (n) reflects amount of "time"
passed: rate of expected mutation if n% of amino
acids had changed
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
average % aa identity in the MSA from which the
matrix was generated
• PAM1 - for less divergent sequences (shorter time)
• PAM250 - for more divergent sequences (longer time)
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
• BLOSUM45 - for more divergent sequences
• BLOSUM62 - for less divergent sequences
37
PAM250 vs BLOSUM 62
9/7/07
38
Which is Better?
PAM or BLOSUM
See Text
Fig 3.5 = PAM250
Fig 3.6= BLOSUM62
• PAM matrices
• derived from evolutionary model
• often used in reconstructing phylogenetic trees - but, not
very good for highly divergent sequences
Usually only 1/2 of matrix is
displayed (it is symmetric)
• BLOSUM matrices
• based on direct observations
• more 'realistic" - and outperform PAM matrices in terms of
accuracy in local alignment
Here:
s(a,b) corresponds to score of
aligning character a with
character b
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
39
Which Type of Matrix Should
You Use?
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
40
Sequence Alignment Statistics
Several other types of matrices available:
• Distribution of similarity scores in sequence alignment
is not a simple "normal" distribution
• Gonnet & Jones-Taylor-Thornton:
• very robust in tree construction
• "Gumble extreme value distribution" - a highly skewed
normal distribution with a long tail
• "Best" matrix depends on task:
• different matrices for different applications
ADVICE: if unsure, try several different matrices
& choose the one that gives best alignment result
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
BCB 444/544 Fall 07 Dobbs
9/7/07
41
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
42
7
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
How Assess Statistical Significance
of an Alignment?
Chp 4- Database Similarity Searching
• Compare score of an alignment with distribution of scores
of alignments for many 'randomized' (shuffled) versions of
the original sequence
• If score is in extreme margin, then unlikely due to random
chance
SECTION II
Xiong: Chp 4
Database Similarity Searching
•
•
•
•
•
•
• P-value = probability that original alignment is due to
random chance (lower P is better)
P = 10-5 - 10-50
P > 10-1
sequences have clear homology
no better than random
Check out: PRSS (Probability of Random Shuffles)
http://www.ch.embnet.org/software/PRSS_form.html
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
43
Today's Lab: focus on BLAST
Basic Local Alignment Search
Exhaustive - tests every possible solution
• guaranteed to give best answer
• e.g., Dynamic Programming
as in Smith-Waterman algorithm
Heuristic - does NOT test every possibility
• no guarantee that answer is best
(but, often can identify optimal solution)
• sacrifices accuracy (potentially) for speed
• uses "rules of thumb" or "shortcuts"
• e.g., BLAST & FASTA
9/7/07
45
Tool
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
46
9/7/07
48
BLAST - a few details
Search Tool
Developed by Stephen Altschul at NCBI in 1990
BLAST Results?
•
• Original version of BLAST?
List of HSPs = Maximum Scoring Pairs
Word length?
•
•
•
How? Allows score to drop below threshold,
(but only temporarily)
47
3 aa for protein sequence
11 nt for DNA sequence
Default is BLOSUM62
Can change under Algorithm Parameters
Choose other BLOSUM or PAM matrices
Stop-Extension Threshold?
•
9/7/07
Typically:
Substitution matrix?
•
•
•
• More recent, improved version of BLAST?
Allows gaps: Gapped Alignment
BCB 444/544 Fall 07 Dobbs
44
1. Create list of very possible "word" (e.g., 3-11 letters)
from query sequence
2. Search database to identify sequences that contain
matching words
3. Score match of word with sequence, using a substitution
matrix
4. Extend match (seed) in both directions, while calculating
alignment score at each step
5. Continue extension until score drops below a threshold
(due to mismatches)
High Scoring Segment Pair (HSP) - contiguous aligned
segment pair (no gaps)
• can be very time/space intensive!
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
STEPS:
(identifies optimal solution)
Lab3: focus on BLAST
Basic Local Alignment
Unique Requirements of Database Searching
Heuristic Database Searching
Basic Local Alignment Search Tool (BLAST)
FASTA
Comparison of FASTA and BLAST
Database Searching with Smith-Waterman Method
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
Exhaustive vs Heuristic Methods
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
SEQUENCE ALIGNMENT
Typically:
22 for proteins
20 for DNA
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
8
#8 Finish DP, Scoring Matrices, Stats
& BLAST
9/7/07
BLAST - Statistical Significance?
1. E-value: E = m x n x P
m = total number of residues in database
n = number of residues in query sequence
P = probability that an HSP is result of random
chance
lower E-value, less likely to result from
random chance, thus higher significance
2. Bit Score: S'
normalized score, to account for differences in
sequence length & size of database
3. Low Complexity Masking
remove repeats that confound scoring
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
BCB 444/544 Fall 07 Dobbs
9/7/07
49
9
Download