#10 BLAST Details + some Gene 9/12/07 Jargon BCB 444/544

advertisement
#10 BLAST Details + some Gene
Jargon
9/12/07
Required Reading
BCB 444/544
(before lecture)
√ Mon Sept 10 - for Lecture 9
BLAST variations; BLAST vs FASTA, SW
• Chp 4 - pp 51-62
Lecture 10
BLAST Details
√ Wed Sept 12 - for Lecture 10 & Lab 4
Multiple Sequence Alignment (MSA)
• Chp 5 - pp 63-74
Plus some Gene Jargon
Fri Sept 14 - for Lecture 11
Position Specific Scoring Matrices & Profiles
• Chp 6 - pp 75-78 (but not HMMs)
#10_Sept12
• Good Additional Resource re: Sequence Alignment?
• Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
1
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Review: Gene Jargon #1
Assignments & Announcements - #1
9/12/07
2
(for HW2, 1c)
Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes
Revised Grading Policy has been sent via email
Please review!
vs
Introns = "intervening sequences"
= segments of eukaryotic genes that "interrupt" exons
√ Mon Sept 10 - Lab 3 Exercise due 5 PM:
to: terrible@iastate.edu
Thu Sept 13 - Graded Labs 2 & 3
will be returned at beginning of Lab 4
Fri Sept 14 - HW#2 due by 5 PM (106 MBB)
Study Guide for Exam 1 will be posted by 5 PM
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
• Introns are transcribed into pre-RNA
• but are later removed by RNA processing
• & do not appear in mature mRNA
• so are not translated into protein
3
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Assignments & Announcements - #2
SECTION II
Pairwise Sequence Alignment
•
•
•
•
•
•
Fri Sept 21 - Exam 1 - Will cover:
Lectures 2-12 (thru Mon Sept 17)
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming
BCB 444/544 Fall 07 Dobbs
SEQUENCE ALIGNMENT
Xiong: Chp 3
Thu Sept 20 - Lab = Optional Review Session for Exam
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
4
Chp 3- Sequence Alignment
Mon Sept 17 - Answers to HW#2
will be posted by 5 PM
•
•
•
•
9/12/07
9/12/07
5
√ Evolutionary Basis
√ Sequence Homology versus Sequence Similarity
√ Sequence Similarity versus Sequence Identity
√ Methods - (Dot Plots, DP; Global vs Local Alignment)
√ Scoring Matrices (PAM vs BLOSUM)
√ Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
6
1
#10 BLAST Details + some Gene
Jargon
Local Alignment: Algorithm
9/12/07
This slide has
been changed!
Local Alignment DP:
Initialization & Recursion
1) Initialize top row & leftmost column of matrix with "0"
S (0,0) = 0
2) Fill in DP matrix:
In local alignment, no negative scores
Assign "0" to cells with negative scores
4) Optimal alignment(s)? Traceback from each cell
containing the optimal score, until a cell with "0" is
reached (not just from lower right corner)
9/12/07
S(i,0) = 0 S(0, j) = 0
%
'S i "1, j "1 + # x , y
) ( i j)
' (
S (i, j ) = max&S (i "1, j ) " $
!
'S (i, j "1) " $
'
(0
3) Optimal score? in highest scoring cell(s)
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
New Slide
7
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
8
!
Calculating an Alignment Score using
a Substitution Matrix &
an Affine Gap Penalty
A Few Words about Parameter Selection
in Sequence Alignment
Optimal alignment between a pair of sequences depends critically
• Alignment score is sum of all match/mismatch
scores (from substitution matrix) with an affine
penalty subtracted for each gap
on the selection of substitution matrix &
gap penalty function
%S (i "1, j "1) + # ( xi , y j )
'
S (i, j ) = max&S (i "1, j ) " $
'S i, j "1 " $
)
( (
Match
a b c - - d
score
a c c e f d
9 2 7
6 => 24
-
In using BLAST or similar software, it is important to understand and,
sometimes, to adjust these parameters (default is NOT always best!)
!
Gap opening
+ extension
(10 + 2) = 12
Values from
substitution matrix
Alignment
Score
How do we pick parameters that give the most biologically
meaningful alignments and alignment scores?
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
9
10
SEQUENCE ALIGNMENT
Xiong: Chp 4
Query Sequence
Database Similarity Searching
•
•
•
•
•
•
9/12/07
Database searching
Chp 4- Database Similarity Searching
SECTION II
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Target
sequences
ranked by score
Unique Requirements of Database Searching
Heuristic Database Searching
Basic Local Alignment Search Tool (BLAST)
FASTA
Comparison of FASTA and BLAST
Database Searching with Smith-Waterman Method
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
BCB 444/544 Fall 07 Dobbs
9/12/07
Sequence
database
Sequence
comparison
algorithm
11
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
12
2
#10 BLAST Details + some Gene
Jargon
9/12/07
Recall: There are 3 Basic Types of
Alignment Algorithms?
Why search a database?
• Given a newly discovered gene,
SECTION II
• Does it occur in other species?
• Is its function known in another species?
Xiong: Chp 3
1) Dot Matrix
2) Dynamic Programming
Xiong: Chp 4
3) Word or k-tuple methods
(BLAST & FASTA)
• Given a newly sequenced genome, which regions align
with genomes of other organisms?
•
•
Identification of potential genes
Identification of other functional parts of chromosomes
• Find members of a multigene family
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Wikipedia:
Word methods, also known as k-tuple methods, are heuristic methods
that are not guaranteed to find an optimal alignment solution, but are
significantly more efficient than dynamic programming.
9/12/07
13
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Exhaustive - tests every possible solution
• guaranteed to give best answer
(identifies optimal solution)
• e.g., Dynamic Programming
(as in Smith-Waterman algorithm)
Heuristic - does NOT test every possibility
• no guarantee that answer is best
• DP for pairwise alignment is O(NM)
• Searching in a database is O(NMK)
 Need faster algorithms for searching in large
databases!
(but, often can identify optimal solution)
• sacrifices accuracy (potentially) for speed
• uses "rules of thumb" or "shortcuts"
• e.g., BLAST & FASTA
9/12/07
15
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Lab3: focus on BLAST
Basic Local Alignment
FASTA vs BLAST
• Both FASTA, BLAST are based on heuristics
• Tradeoff:
Sensitivity vs Speed
• DP is slower, but more sensitive
9/12/07
16
Search Tool
STEPS:
1.
• FASTA
2.
• user defines value for k = word length
• Slower, but more sensitive than BLAST at lower values of k,
(preferred for searches involving a very short query sequence)
3.
4.
• BLAST family
5.
• Family of different algorithms optimized for particular types of
queries, such as searching for distantly related sequence matches
• BLAST was developed to provide a faster alternative to FASTA
without sacrificing much accuracy
BCB 444/544 Fall 07 Dobbs
14
• Your query is 200 amino acids long (N )
• You are searching a non-redundant database, which
currently contains >106 proteins (K)
• If proteins in database have avg length 200 aa (M), then:
 Must fill in 200 × 200 × 106 = 4 × 1010 DP entries!!
• 4 × 1010 operations just to fill in the DP matrix!
• can be very time/space intensive!
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
Why do we Need Fast Search Algorithms?
Exhaustive vs Heuristic Methods
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
SEQUENCE ALIGNMENT
9/12/07
Create list of very possible "word" (e.g., 3-11 letters)
from query sequence
Search database to identify sequences that contain
matching words
Score match of word with sequence, using a substitution matrix
Extend match (seed) in both directions, while calculating alignment
score at each step
Continue extension until score drops below a threshold (due to
mismatches)
High Scoring Segment Pair (HSP) - contiguous aligned
segment pair (no gaps)
17
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
18
3
#10 BLAST Details + some Gene
Jargon
9/12/07
Why is Gapped Alignment Harder?
What are the Results of a BLAST Search?
Original version of BLAST?
List of HSPs called Maximum Scoring Pairs
•
•
More recent, improved version of BLAST?
Allows gaps: Gapped Alignment
Without gaps, there are N+M-1 possible alignments between
sequences of length N and M
Once we start allowing gaps, there are many more possible
arrangements to consider:
abcbcd
||| |
abc--d
How? Allows score to drop below threshold,
(but only temporarily)
•
abcbcd
| |||
a--bcd
abcbcd
|| ||
ab--cd
Becomes a very large number when we also allow mismatches,
because we need to look at every possible pairing between elements:
Roughly NM possible alignments!
e.g.: for N=M=100, there are 100100=10200 possible alignments
& 100 aa is a small protein!
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
19
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
BLAST - a few details
Word length?
•
•
Substitution matrix?
•
•
•
•
•
Typically:
1. E-value: E = m x n x P
m = total number of residues in database
n = number of residues in query sequence
P = probability that an HSP is result of random chance
lower E-value, less likely to result from
random chance, thus higher significance
3 aa for protein sequence
11 nt for DNA sequence
Default is BLOSUM62
Can change under Algorithm Parameters
Can choose other BLOSUM or PAM matrices
Change other parameters here, too
2. Bit Score: S'
normalized score, to account for differences in size of
database (m) & sequence length(n) - more later
Stop-Extension Threshold?
•
Typically:
22 for proteins
20 for DNA
3. Low Complexity Masking
remove repeats that confound scoring - more sooner
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
21
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Local
alignment
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
BCB 444/544 Fall 07 Dobbs
9/12/07
22
BLAST - a Family of Programs:
Different BLAST "flavors"
BLAST algorithms can generate both
"global" and "local" alignments
Global
alignment
20
BLAST - Statistical Significance?
Developed by Stephen Altschul at NCBI in 1990
•
9/12/07
•
•
•
•
•
BLASTP - protein sequence query against protein DB
BLASTN - DNA/RNA seq query against DNA DB (GenBank)
BLASTX - 6-frame translated DNA seq query against protein DB
TBLASTN - protein query against 6-frame DNA translation
TBLASTX - 6-frame DNA query to 6-frame DNA translation
•
•
•
PSI-BLAST - protein "profile" query against protein DB
PHI-BLAST - protein pattern against protein DB
Newest: MEGA-BLAST - optimized for highly similar sequences
Which tool should you use?
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml
9/12/07
23
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
24
4
#10 BLAST Details + some Gene
Jargon
9/12/07
Review: Gene Jargon #2.1
Review: Gene Jargon #2.2
6-Frame translated DNA Sequence?
6-Frame translated DNA Sequence?
Remember GeneBoy exercise?
Try NCBI tools:
http://www.ncbi.nlm.nih.gov/gorf/orfig.cgi
http://www.ncbi.nlm.nih.gov/
Or - for some Biology review re: DNA/RNA & ORFs,
see next 3 slides borrowed from EMBL-EBI:
http://www.ebi.ac.uk/
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Review: Gene Jargon #2.3
9/12/07
25
http://www.ebi.ac.uk/
DNA Strands
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Review: Gene Jargon #2.5
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Review: Gene Jargon #2.4
9/12/07
26
http://www.ebi.ac.uk/
RNA Strands - copied from DNA
9/12/07
27
http://www.ebi.ac.uk/
Reading Frames
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
28
9/12/07
30
BLAST - How does it work?
Main idea - based on dot plots!
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
BCB 444/544 Fall 07 Dobbs
9/12/07
29
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
5
#10 BLAST Details + some Gene
Jargon
9/12/07
Dot Plots - apply in BLAST:
Detailed Steps in BLAST algorithm
1. Remove low-complexity regions (LCRs)
2. Make a list (dictionary): all words of length 3aa or 11 nt
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
Perform fast, approximate
local alignments to find
sequences in database that
are related to query sequence
3. Augment list to include similar words
4. Store list in a search tree (data structure)
5. Scan database for occurrences of words in search tree
6. Connect nearby occurrences
7. Extend matches (words) in both directions
Here, use 4-base "window"
75% identity (allow mismatches)
8. Prune list of matches using a score threshold
9. Evaluate significance of each remaining match
10. Perform Smith-Waterman to get alignment
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
1: Filter low-complexity regions
(LCRs)
9/12/07
31
This slide has
been changed!
For CGTA: K=1/4 log4 (24/1) = 0.57
K=
32
9/12/07
34
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
FMT
MTS
TSE
SEK
…
1
L!
log N $
!
L
$ ' ni ! !
% i
"
Frequency of ith
letter in the window
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
33
3: Augment word list
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
3: Augment word list
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
AAA
AAB
FMT
AAC
MTS
203 = 8000
…
TSE
possible matches
SEK
YYY
…
BCB 444/544 Fall 07 Dobbs
9/12/07
2: List all words in query
K = computational complexity;
• Low complexity regions,
varies from 0 (very low complexity)
transmembrane regions and
to 1 (high complexity)
coiled-coil regions often display
Alphabet size
significant similarity without
(4 or 20)
Window length
homology.
(usually 12)
• Low complexity sequences can
yield false positives.
• Screen them out of your query
&
#
sequences! When appropriate!
$
!
e.g., for GGGG:
L! = 4!=4x3x2x1= 24
nG =4 nT =nA =nC =0
Π ni ! = 4!x0!x0!x0! = 24
K=1/4 log4 (24/24) = 0
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
35
BLOSUM62
scores
G
G
F
A
A
A
0 + 0 + -2 = -2
Non-match
G
G
G
G
6 + 6 +
Match
F
Y
3 = 15
A user-specified threshold, T, determines which 3-letter
words are considered matches and non-matches
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
36
6
#10 BLAST Details + some Gene
Jargon
9/12/07
3: Augment word list
3: Augment word list
Observation:
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
GGI
GGL
FMT
GGM
MTS
GGF
GGW
TSE
GGY
SEK
…
…
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Selecting only words with score > T greatly reduces
number of possible matches
otherwise, 203 for 3-letter words from amino acid sequences!
9/12/07
37
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Example
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
38
4: Store words in search tree
Find all words that match EAM with a score greater
than or equal to 11
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
9/12/07
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
EAM
DAM
QAM
ESM
EAL
5
2
2
5
5
+
+
+
+
+
4
4
4
1
4
+
+
+
+
+
5
5
5
5
2
Augmented list of
query words
=
=
=
=
=
“Does this query contain GGF?”
14
11
11
11
11
Search tree
“Yes, at position 2.”
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
39
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
Search tree
9/12/07
40
9/12/07
42
Example
Put this word list into a search tree
GGF
GGL
GGM
GGW
GGY
DAM
QAM
EAM
KAM
ECM
EGM
ESM
ETM
EVM
EAI
EAL
EAV
G
G
F
L
M
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
BCB 444/544 Fall 07 Dobbs
W
Y
9/12/07
41
D
A
M
A
A
M
I
Q
E
K
C
G
S
T
V
A
M
M
M
M
M
M
V
L
M
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
7
#10 BLAST Details + some Gene
Jargon
9/12/07
5: Scan the database sequences
Example
Scan this "database" for occurrences of your words
Database sequence
MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA
E
A
M
P
Q
L
S
V
D
A
M
Query sequence
•
•
•
•
•
•
•
•
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
43
6: Connect nearby occurences
(diagonal matches in Gapped BLAST)
Query sequence
•
44
DB
•
•
•
•
•
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
45
7: Extend matches,
calculating score at each step
L P
M P
9/12/07
Scan
Two dots are connected
IFF if they are less
than A letters apart &
are on diagonal
•
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
7: Extend matches in both directions
Database sequence
•
•
P Q G L L
P E G L L
<word>
7 2 6
<----->
2 7 7 2 6 4 4
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
46
9/12/07
48
8: Prune matches
Query sequence
Database sequence
• Discard all matches that score below defined
threshold
BLOSUM62 scores
word score = 15
HSP SCORE = 32
(High Scoring Pair)
• Each match is extended to left & right until a
negative BLOSUM62 score is encountered
• Extension step typically accounts for > 90% of
execution time
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
BCB 444/544 Fall 07 Dobbs
9/12/07
47
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
8
#10 BLAST Details + some Gene
Jargon
9: Evaluate significance
9/12/07
This slide has
been changed!
10: Use Smith-Waterman algorithm
(DP) to generate alignment
• BLAST uses an analytical statistical significance
calculation
• ONLY significant matches are re-analyzed using
Smith-Waterman DP algorithm.
• Alignments reported by BLAST are produced by
dynamic programming
RECALL:
1.
E-value: E = m x n x P
m = total number of residues in database
n = number of residues in query sequence
P = probability that an HSP is result of random chance
lower E-value, less likely to result from random chance,
thus higher significance
2.
Bit Score: S' =
normalized score, to account for differences in size of database ( m) & sequence
length(n) ; Note (below) that bit score is linearly related to raw alignment
score, so: higher S' means alignment has higher significance
S'= (λ X S - ln K)/ln2 where:
λ = Gumble distribution constant
S = raw alignment score
K = constant associated with scoring matrix
For more details - see text & BLAST tutorial
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
49
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
• A hit is a w-length word in database that aligns with a
word from query sequence with score > T
• BLAST looks for hits instead of exact matches
• If you don’t know, use default parameters first
• Try several programs & several parameter settings
• If possible, search on protein sequence level
• Allows word size to be kept larger for speed, without sacrificing
sensitivity
• Scoring matrices:
• Typically, w = 3-5 for amino acids,
PAM1 / BLOSUM80:
if expect/want less divergent proteins
PAM120 / BLOSUM62: "average" proteins
PAM250 / BLOSUM45: if need to find more divergent proteins
w = 11-12 for DNA
• T is the most critical parameter:
• ↑T ⇒ ↓ “background” hits (faster)
• ↓T ⇒ ↑ ability to detect more distant relationships
(at cost of increased noise)
• Proteins:
>25-30% identity (and >100aa)
15-25% identity
<15% identity
9/12/07
51
Practical Issues
-> likely related
-> twilight zone
-> likely unrelated
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
• Seeding:
• BLAST integrates scoring matrix into first phase
• FASTA requires exact matches (uses hashing)
• DNA yields more random matches:
• BLAST increases search speed by finding fewer, but
better, words during initial screening phase
• FASTA uses shorter word sizes - so can be more
sensitive
• 25% for DNA vs. 5% for proteins
• DNA databases are larger and grow faster
• Selection (generally) acts on protein level
• Synonymous mutations are usually neutral
• DNA sequence similarity decays faster
BCB 444/544 Fall 07 Dobbs
52
BLAST vs FASTA
Searching on DNA or protein level?
In general,
protein-encoding DNA should be translated!
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
50
Tips for BLAST Similarity Searches
BLAST: What is a "Hit"?
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
• Results:
• BLAST can return multiple best scoring alignments
• FASTA returns only one final alignment
9/12/07
53
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
54
9
#10 BLAST Details + some Gene
Jargon
9/12/07
BLAST Notes - & DP Alternatives
BLAST & FASTA References
• FASTA -
• BLAST uses heuristics: it may miss some good matches
• But, it’s fast: 50 - 100X faster than Smith-Waterman (SW) DP
• Large impact:
• NCBI’s BLAST server handles more than 100,000 queries/day
• Most used bioinformatics program in the world!
 But - Xiong says: "It has been estimated that for some families of
protein sequences BLAST can miss 30% of truly significant matches."
developed first
• Pearson & Lipman (1988) Improved Tools for Biological
Sequence Comparison. PNAS 85:2444- 2448
• BLAST
• Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990)
• Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman
(1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res.
25:3389-402
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
NCBI - BLAST Programs
Glossary & Tutorials
9/12/07
http://www.ncbi.nlm.nih.gov/BLAST/
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
BCB 444/544 Fall 07 Dobbs
• 2 DP-based web servers: both more sensitive than BLAST
• Scan Protein Sequence: http://www.ebi.ac.uk/scanps/index.html
Implements modified SW optimized for parallel processing
• ParAlign www.paralign.org - parallel SW or heuristics
55
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
9/12/07
56
BLAST
•
BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon
• Increased availability of parallel processing has made DP-based
approaches feasible:
9/12/07
57
10
Download