BCB 444/544 Lecture 8 #8_Sept7 Finish:

advertisement
BCB 444/544
Lecture 8
Finish: Dynamic Programming
Global vs Local Alignment
Scoring Matrices & Alignment Statistics
BLAST
#8_Sept7
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
1
Required Reading
(before lecture)
√Last week: - for Lectures 4-7
Pairwise Sequence Alignment, Dynamic Programming,
Global vs Local Alignment, Scoring Matrices, Statistics
• Xiong: Chp 3
• Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909
http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html
√Wed Sept 5 - for Lecture 7 & Lab 3
Database Similarity Searching: BLAST (nope, more DP)
• Chp 4 - pp 51-62
Fri Sept 7 - for Lecture 8 (will finish on Monday)
BLAST variations; BLAST vs FASTA
• Chp 4 - pp 51-62
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
2
Assignments & Announcements
√Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PM
Send via email to Pete Zaback petez@iastate.edu
(For now, no late penalty - just send ASAP)
√Wed Sept 5 - Notes for Lecture 5 posted online
- HW#2 posted online & sent via email
& handed out in class
Fri Sept 14
- HW#2 Due by 5 PM
Fri Sept 21
- Exam #1
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
3
Chp 3- Sequence Alignment
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 3
Pairwise Sequence Alignment
•
•
•
•
•
•
√Evolutionary Basis
√Sequence Homology versus Sequence Similarity
√Sequence Similarity versus Sequence Identity
Methods - cont
Scoring Matrices
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
4
Methods
• √Global and Local Alignment
• √Alignment Algorithms
• √Dot Matrix Method
• Dynamic Programming Method - cont
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
5
Dynamic Programming - 4 Steps:
1. Define score of optimal alignment, using recursion
2. Initialize and fill in a DP matrix for storing optimal
scores of subproblems, by solving smallest
subproblems first (bottom-up approach)
3. Calculate score of optimal alignment(s)
4. Trace back through matrix to recover optimal
alignment(s) that generated optimal score
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
6
1- Define Score of Optimal Alignment
using Recursion
Define:
x1..i  Prefix of length i of x
y1.. j  Prefix of length j of y
S(i, j)  Score of optimal alignment of x1..i and y1..j

Initial
conditions:
S(i,0)  i   S(0, j)   j  
 = Match Reward
 = Mismatch Penalty
 = Gap penalty
Recursive definition:
For 1  i  N, 1  j  M:

S(i 1, j 1)   (xi , y j )

S(i, j)  max S(i 1, j)

S(i, j 1)


(xi,yj) = 
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST

or

= Gap penalty
9/7/07
7
2- Initialize & Fill in DP Matrix for
Storing Optimal Scores of Subproblems
• Construct sequence vs sequence matrix
• Fill in from [0,0] to [N,M] (row by row), calculating best
possible score for each alignment ending at residues at [i,j]
0
0
1
1
N
S(0,0)=0
S(i,j)
M
S(N,M)
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
8
How do we calculate S(i,j)?
i.e., Score for alignment of x[1..i] to y[1..j]?
1 of 3 cases  optimal score for this subproblem:
xi aligns to yj
xi aligns to a gap
yj aligns to a gap
x1 x2 . . . xi-1 xi
x1 x2 . . . xi-1 xi
x1 x2 . . . x i
y1 y2 . . . yj-1 yj
y1 y2 . . . yj
y1 y2 . . . yj-1 yj
S(i-1,j-1) + (xi,yj)
S(i-1,j)
—
-
S(i,j-1)
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
—
-
9
Note: I changed sequences
on this slide (to match the
rest of DP example)
Specific Example:
Case 1: Line up xi with yj
x: C
y: C
A
T
T
C
-
Scoring Consequence?
i-1
G C
T C
j-1
Case 2: Line up xi with space
x: C
y: C
A
T
T
C
-
G
T
Case 3: Line up yj with space
x: C
y: C
A
T
T
C
-
G
T
i
A
A
j
i-1
C C A
j
Match Bonus
i
A
-
Space Penalty
i
C A C - A
j -1 j
Space Penalty
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
10
Ready? Fill in DP Matrix
Keep track of dependencies of scores (in a pointer matrix)
0
1
0
S(0,0)=0
1
 = Match Reward
 = Mismatch Penalty
 = Gap penalty
M
Initialization
S(i,0)  i  
S(0, j)   j  
N
+
(xi,yj) = 
or

S(i-1,j-1)
S(i-1,j)
S(i,j-1)
S(i,j)
-
-
S(N,M)
Recursion
S(i 1, j 1)   (xi , y j )

S(i, j)  max S(i 1, j)

S(i, j 1)


BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
11
Fill in the DP matrix !!
λ
λ
0
-5
C
A -10
T
-15
T
-20
C
A
-25
-30
C
-35
C
T
C
G
C
A
G
C
-5 -10 -15 -20 -25 -30 -35 -40
10
5
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
12
3- Calculate Score S(N,M) of Optimal
Alignment - for Global Alignment
λ
C
T
C
G
C
A
G
C
λ
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10
5
0
-5
-10
-15
-20
-25
-10
5
8
3
-2
-7
0
-5
-10
T
-15
0
15
10
5
0
-5
-2
-7
T
-20
-5
10
13
8
3
-2
-7
-4
C
A
-25
-10
5
20
15
18
13
8
3
-30
-15
0
15
18
13
28
23
18
C
-35
-20
-5
10
13
28
23
26
33
+10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
13
4- Trace back through matrix to recover
optimal alignment(s) that generated
the optimal score
How? "Repeat" alignment calculations in reverse order,
starting at from position with highest score and
following path, position by position, back through
matrix
Result? Optimal alignment(s) of sequences
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
14
Traceback - for Global Alignment
Start in lower right corner & trace back to upper left
Each arrow introduces one character at end of alignment:
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
15
Traceback to Recover Alignment
λ
C
T
C
G
C
A
G
C
λ
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10
5
0
-5
-10
-15
-20
-25
-10
5
8
3
-2
-7
0
-5
-10
T
-15
0
15
10
5
0
-5
-2
-7
T
-20
-5
10
13
8
3
-2
-7
-4
C
A
-25
-10
5
20
15
18
13
8
3
-30
-15
0
15
18
13
28
23
18
C
-35
-20
-5
10
13
28
23
26
33
Can have >1 optimal alignment; this example has 2
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
16
Traceback to Recover Alignment
λ
C
T
C
G
C
A
G
C
λ
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10
5
0
-5
-10
-15
-20
-25
-10
5
8
3
-2
-7
0
-5
-10
T
-15
0
15
10
5
0
-5
-2
-7
T
-20
-5
10
13
8
3
-2
-7
-4
C
A
-25
-10
5
20
15
18
13
8
3
-30
-15
0
15
18
13
28
23
18
C
-35
-20
-5
10
13
28
23
26
33
Where did red arrows come from?
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
17
Traceback to Recover Alignment
λ
C
T
C
G
C
A
G
C
λ
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10
5
0
-5
-10
-15
-20
-25
-10
5
8
3
-2
-7
0
-5
-10
T
-15
0
15
10
5
0
-5
-2
-7
T
-20
-5
10
13
8
3
-2
-7
-4
C
A
-25
-10
5
20
15
18
13
8
3
-30
-15
0
15
18
13
28
23
18
C
-35
-20
-5
10
13
28
23
26
33
+10 for match, -2 for mismatch, -5 for space
• Where did 33 come from? Match = 10, so 33-10= 23
Must have come from diagonal
• Where did 23 come from? (Not a match)
Left? 28-5= 23; Diag? 13-2= 11; Top? 8-5= 3
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
18
Traceback to Recover Alignment
λ
C
T
C
G
C
A
G
C
λ
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10
5
0
-5
-10
-15
-20
-25
-10
5
8
3
-2
-7
0
-5
-10
T
-15
0
15
10
5
0
-5
-2
-7
T
-20
-5
10
13
8
3
-2
-7
-4
C
A
-25
-10
5
20
15
18
13
8
3
-30
-15
0
15
18
13
28
23
18
C
-35
-20
-5
10
13
28
23
26
33
+10 for match, -2 for mismatch, -5 for space
•
•
Where did 8 come from?
Two possibilities: 13-5= 8 or 10-2=8
Then, follow both paths
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
19
Traceback to Recover Alignment
λ
C
T
C
G
C
A
G
C
λ
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10 C
C with
5
0
-5
-10
-15
-20
-25
-10
5
8 A
- with
3
-2
-7
0
-5
-10
T
-15
0
T 15
with T
10
5
0
-5
-2
-7
T
-20
-5
10
13 C with
3
-2
-7
-4
C
A
-25
-10
5
20
15
13
8
3
-30
-15
0
15
18
13
C
-35
-20
-5
10
13
28
8 T
G with
C18
with C
28 A
A with
23
G23
with -
26
18
C with
33 C
Great - but what are the alignments? #1
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
20
Traceback to Recover Alignment
λ
C
T
C
G
C
A
G
C
λ
0
-5
-10
-15
-20
-25
-30
-35
-40
C
A
-5
10 C
C with
5
0
-5
-10
-15
-20
-25
-10
5
8 A
- with
3
-2
-7
0
-5
-10
T
-15
0
T 15
with T
10
5
0
-5
-2
-7
T
-20
-5
10
8
3
-2
-7
-4
C
A
-25
-10
5
20
15 G with
13
8
3
-30
-15
0
15
18
13
C
-35
-20
-5
10
13
28
C 13
with T
C18
with C
28 A
A with
23
G23
with -
26
18
C with
33 C
Great - but what are the alignments? #2
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
21
What are the 2 Global Alignments
with Optimal Score = 33?
Top: C T C G C A G C
Left: C A T T C A C
1:
2:
C
-
T
C
G
C
A
G
C
C
-
T
C
G
C
A
G
C
• A horizontal move puts a gap in left sequence
• A vertical move puts a gap in top sequence
• A diagonal move uses one character from each sequence
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
22
What are the 2 Global Alignments
with Optimal Score = 33?
Top: C T C G C A G C
Left: C A T T C A C
1:
C
C
A
T
T
C
-
G
T
C
C
A
A
G
-
C
C
2:
C
C
A
T
T
C
T
G
-
C
C
A
A
G
-
C
C
Check the scores: +10 for match, -2 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
23
or, Check Traceback?
λ
C
λ
0
d -5
C
A
-5
10v
-10
5
d 8
T
-15
0
15
T
-20
-5
C
A
-25
C
T
C
G
C
A
G
C
-10
-15
-20
-25
-30
-35
-40
5
0
-5
-10
-15
-20
-25
-2
-7
0
-5
-10
10
d 5
0
-5
-2
-7
10d
13
8
3
-2
-7
-4
-10
5
20
13
8
3
-30
-15
0
23 d
18
-35
-20
-5
26
33
h
1
3
h
d
d
15
18
15
18
13
28
10
13
28
23
2
h
• h= horizontal move puts a gap in left sequence
• v = vertical move puts a gap in top sequence
• d = diagonal move uses one character from each sequence
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
24
Local Alignment: Motivation
• To "ignore" stretches of non-coding DNA:
• Non-coding regions (if "non-functional") are more likely to
contain mutations than coding regions
• Local alignment between two protein-encoding sequences is
likely to be between two exons
• To locate protein domains or motifs:
• Proteins with similar structures and/or similar functions but
from different species (for example), often exhibit local
sequence similarities
• Local sequence similarities may indicate ”functional modules”
Non-coding - "not encoding protein"
Exons - "protein-encoding" parts of genes
vs Introns = "intervening sequences" - segments of eukaryotic
genes that "interrupt" exons
Introns are transcribed into RNA, but are later removed by
RNA processing & are not translated into protein
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
25
Local Alignment: Example
G G T C T G A G
A A A C G A
Match: +2
Mismatch or space: -1
Best local alignment:
G G T C T G A G
A A A C – G A -
Score = 5
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
26
Local Alignment: Algorithm
This slide has
been changed!
1) Initialize top row & leftmost column of matrix with "0"
2) Fill in DP matrix:
In local alignment, no negative scores
Assign "0" to cells with negative scores
3) Optimal score? in highest scoring cell(s)
4) Optimal alignment(s)? Traceback from each cell
containing the optimal score, until a cell with "0" is
reached (not just from lower right corner)
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
27
Local Alignment DP:
Initialization & Recursion
S 0,0  0
New Slide
S(i,0)  0 S(0, j)  0

S i 1, j 1   x , y
  i j
 
S i, j   max S i 1, j   

S i, j 1  

0
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
28
Filling in DP Matrix for Local Alignment
No negative scores - fill in "0"
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2
0
1
0
0
1
0
0
0
0
1
0
2
0
0
0
1
0
1
0
2
0
1
1
λ
C
A
C
A
C
+1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
29
Traceback - for Local Alignment
λ
C
T
C
G
C
A
G
C
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
2 1
0
0
T
0
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
0
0
0
1
0
2 4 0
1
0
0
1
0
0
0
0
1
0
2 2 0
0
0
1
0
1
0
2
0
1
λ
C
A
C
A
C
3
1
+1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
30
What are the 4 Local Alignments with
Optimal Score = 2?
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
T
C
G
C
A
G
C
2:
C
T
C
G
C
A
G
C
3:
C
T
C
G
C
A
G
C
4:
C
T
C
G
C
A
G
C
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
31
What are the 4 Local Alignments with
Optimal Score = 2?
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
1:
C
-
T
-
C
-
G
-
C
C
A
A
G
T
C
T
2:
C
C
T
A
C
T
G
T
C
C
A
A
G
C
C
3:
C
T
T
T
C
C
G
A
C
C
A
G
C
4:
C
T
T
T
C
C
G
A
C
C
A
G
C
Check the scores: +1 for match, -1 for mismatch, -5 for space
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
32
Some Results re: Alignment Algorithms
(for ComS, CprE & Math types)
• Most pairwise sequence alignment problems can be
solved in O(mn) time
• Space requirement can be reduced to O(m+n), while
keeping run-time fixed [Myers88]
• Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences [Landau86]
for Biologists: Big O notation
• used when analyzing algorithms for efficiency
• refers to time or number of steps it takes to
solve a problem
• expressed as a function of size of the problem
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
33
Affine Gap Penalty Functions
Affine Gap Penalties = Differential Gap Penalties
used to reflect cost differences between opening a
gap and extending an existing gap
Total Gap Penalty is linear function of gap length:
W =
where

+

X
(k - 1)
 = gap opening penalty
 = gap extension penalty
Can also be solved in
O(nm) time using DP
k = length of gap
Sometimes, a Constant Gap Penalty is used, but it is usually
least realistic than the Affine Gap Penalty
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
34
Methods
•
•
•
•
√Global and Local Alignment
√Alignment Algorithms
√Dot Matrix Method
√Dynamic Programming Method - cont
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
35
"Scoring" or "Substitution" Matrices
2 Major types for Amino Acids: PAM & BLOSUM
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in alignments of closely related proteins
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
36
PAM Matrix
PAM = Point Accepted Mutation
relies on "evolutionary model" based on observed
differences in closely related proteins
• Model includes defined rate for each type of
sequence change
• Suffix number (n) reflects amount of "time"
passed: rate of expected mutation if n% of amino
acids had changed
• PAM1 - for less divergent sequences (shorter time)
• PAM250 - for more divergent sequences (longer time)
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
37
BLOSUM Matrix
BLOSUM = BLOck SUbstitution Matrix
based on % aa substitutions observed in blocks of
conserved sequences within evolutionarily divergent
proteins
• Doesn't rely on a specific evolutionary model
• Suffix number (n) reflects expected similarity:
average % aa identity in the MSA from which the
matrix was generated
• BLOSUM45 - for more divergent sequences
• BLOSUM62 - for less divergent sequences
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
38
PAM250 vs BLOSUM 62
See Text
Fig 3.5 = PAM250
Fig 3.6= BLOSUM62
Usually only 1/2 of matrix is
displayed (it is symmetric)
Here:
s(a,b) corresponds to score of
aligning character a with
character b
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
39
Which is Better?
PAM or BLOSUM
• PAM matrices
• derived from evolutionary model
• often used in reconstructing phylogenetic trees - but, not
very good for highly divergent sequences
• BLOSUM matrices
• based on direct observations
• more 'realistic" - and outperform PAM matrices in terms of
accuracy in local alignment
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
40
Which Type of Matrix Should
You Use?
Several other types of matrices available:
• Gonnet & Jones-Taylor-Thornton:
• very robust in tree construction
• "Best" matrix depends on task:
• different matrices for different applications
ADVICE: if unsure, try several different matrices
& choose the one that gives best alignment result
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
41
Sequence Alignment Statistics
• Distribution of similarity scores in sequence alignment
is not a simple "normal" distribution
• "Gumble extreme value distribution" - a highly skewed
normal distribution with a long tail
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
42
How Assess Statistical Significance
of an Alignment?
• Compare score of an alignment with distribution of scores
of alignments for many 'randomized' (shuffled) versions of
the original sequence
• If score is in extreme margin, then unlikely due to random
chance
• P-value = probability that original alignment is due to
random chance (lower P is better)
P = 10-5 - 10-50
P > 10-1
sequences have clear homology
no better than random
Check out: PRSS (Probability of Random Shuffles)
http://www.ch.embnet.org/software/PRSS_form.html
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
43
Chp 4- Database Similarity Searching
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 4
Database Similarity Searching
•
•
•
•
•
•
Unique Requirements of Database Searching
Heuristic Database Searching
Basic Local Alignment Search Tool (BLAST)
FASTA
Comparison of FASTA and BLAST
Database Searching with Smith-Waterman Method
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
44
Exhaustive vs Heuristic Methods
Exhaustive - tests every possible solution
• guaranteed to give best answer
(identifies optimal solution)
• can be very time/space intensive!
• e.g., Dynamic Programming
as in Smith-Waterman algorithm
Heuristic - does NOT test every possibility
• no guarantee that answer is best
(but, often can identify optimal solution)
• sacrifices accuracy (potentially) for speed
• uses "rules of thumb" or "shortcuts"
• e.g., BLAST & FASTA
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
45
Today's Lab: focus on BLAST
Basic Local Alignment Search
Tool
STEPS:
1. Create list of very possible "word" (e.g., 3-11 letters)
from query sequence
2. Search database to identify sequences that contain
matching words
3. Score match of word with sequence, using a substitution
matrix
4. Extend match (seed) in both directions, while calculating
alignment score at each step
5. Continue extension until score drops below a threshold
(due to mismatches)
High Scoring Segment Pair (HSP) - contiguous aligned
segment pair (no gaps)
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
46
Lab3: focus on BLAST
Basic Local Alignment
Search Tool
BLAST Results?
• Original version of BLAST?
List of HSPs = Maximum Scoring Pairs
• More recent, improved version of BLAST?
Allows gaps: Gapped Alignment
How? Allows score to drop below threshold,
(but only temporarily)
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
47
BLAST - a few details
Developed by Stephen Altschul at NCBI in 1990
•
Word length?
•
•
Substitution matrix?
•
•
•
•
Typically:
3 aa for protein sequence
11 nt for DNA sequence
Default is BLOSUM62
Can change under Algorithm Parameters
Choose other BLOSUM or PAM matrices
Stop-Extension Threshold?
•
Typically:
22 for proteins
20 for DNA
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
48
BLAST - Statistical Significance?
1. E-value: E = m x n x P
m = total number of residues in database
n = number of residues in query sequence
P = probability that an HSP is result of random
chance
lower E-value, less likely to result from
random chance, thus higher significance
2. Bit Score: S'
normalized score, to account for differences in
sequence length & size of database
3. Low Complexity Masking
remove repeats that confound scoring
BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST
9/7/07
49
Download