BLAST: Basic Local Alignment Search Tool

advertisement
BLAST:
Basic Local Alignment Search Tool
Jonathan M. Urbach
Bioinformatics Group
Department of Molecular Biology
1
Topics to be covered:
"
BLAST as a Sequence Alignment Tool
"
Uses of BLAST
"
Types of BLAST
"
How BLAST works
Scanning for 'hits'
Scoring with Substitution Matrices
"
Common Databases for Use with BLAST available at NCBI
"
Interpretation of Blast Results
"
Blast options: on the net or on your computer
"
Learning More About BLAST,
"
A BLAST demo
2
gi|13325078|gb|AAG33875.2| (AF232004) HrpL [Pseudomonas syringae pv. tomato]
Length = 184
Score = 347 bits (889), Expect = 4e-95
Identities = 182/184 (98%), Positives = 183/184 (98%)
Query: 1 MFQKIVILDSTQPRQPSSSAGIRQMTADQIQMLRAFIQKRVMNPDDVDDILQCVFLEALR 60
MFQKIVILDSTQPRQPSSSAGIRQMTADQIQMLRAFIQKRVMNPDDVDDILQCVFLEALR
Sbjct: 1 MFQKIVILDSTQPRQPSSSAGIRQMTADQIQMLRAFIQKRVMNPDDVDDILQCVFLEALR 60
Query: 61 NEHKFQHASKPQTWLCGIALNLIRNHFRKMYRQPYQESWEDEVHSELEGHGDVSHQVDGH 1
NEHKFQHASKPQTWLCGIALNLIRNHFRKMYRQPYQESWEDEVHSELEGHGDVSHQV+GH
Sbjct: 61 NEHKFQHASKPQTWLCGIALNLIRNHFRKMYRQPYQESWEDEVHSELEGHGDVSHQVEGH 12
Query: 121 RQLARVIQAIDCLPSNMQKVLEVSLEMDGNYQETANSLGVPIGTVRSRLSRARVQLKQQI 180
RQLARVIQAIDCLPSNMQKVLEVSLEMDGNYQETANSLGVPIGTVRSRLS ARVQLKQQI
Sbjct: 121 RQLARVIQAIDCLPSNMQKVLEVSLEMDGNYQETANSLGVPIGTVRSRLSGARVQLKQQI 180
Query: 181 DPFA 184
DPFA
Sbjct: 181 DPFA 184
3
4
Sequence Alignment Tools
Database Searching:
BLAST:
NCBI, Web Interface: http://www.ncbi.nlm.nih.gov/BLAST/
WuBLAST http://blast.wustl.edu
FASTA: http://www.ebi.ac.uk/fasta3/
Smith-Waterman
Par-Align: http://dna.uio.no/search/
Multiple Sequence Alignment:
CLUSTALW: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
DiAlign, Web Interface: http://genomatix.gsf.de/cgi-bin/dialign/dialign.pl
MSA:http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html
Web Interface: http://bioweb.pasteur.fr/seqanal/interfaces/msa-simple.html
5
Uses of BLAST:
Query a database for sequences similar to an
input sequence.
6
Uses of BLAST:
Query a database for sequences similar to an
input sequence.
"
Identify previously characterized
sequences.
7
Uses of BLAST:
Query a database for sequences similar to an
input sequence.
"
"
Identify previously characterized
sequences.
Find phylogenetically related sequences.
8
Uses of BLAST:
Query a database for sequences similar to an
input sequence.
"
"
"
Identify previously characterized
sequences.
Find phylogenetically related sequences.
Identify possible functions based on
similarities to known sequences.
9
Types of BLAST:
Graphic courtesy of Joel Graber.
10
How BLAST Works
(1) BLAST scans database for 'words' of a
predetermined length (a 'hit') with some
minimum threshold parameter, T.
(2) BLAST then extends the hit until the
score falls below the maximum score yet
attained minus some value X.
Altschul, S. F. et al., Nucleic Acids Research, 25, 3389-3402 (1997)
11
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
12
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
Use 2 or 3-letter words...
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
13
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
Scan against subject sequence:
>gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase
MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG
SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR
GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA
14
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
A hit!
>gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase
MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG
SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR
GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA
15
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
>gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase
MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG
SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR
GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA
Extension:
Query:
Sbjct:
YFP
YP
YLP
16
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
>gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase
MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG
SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR
GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA
Extension:
Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTKVI
M H A Y L S V W E R L V YP L E T I L
Sbjct: MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPY
17
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
>gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase
MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG
SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR
GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA
Extension:
Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTKVI
M H A Y L S V W E R L V YP L E T I L
Sbjct: MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPY
18
Query:
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK
>gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase
MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG
SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR
GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA
Extension:
Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTKVI
M H A Y L S V W E R L V YP L E T I L
Sbjct: MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPY
HSP: A High-Scoring Segment Pair
19
Towards BLAST Scoring
"
"
"
Expected negative score for alignment of two
random residues.
Maximal score for a perfect match.
Combinations of residues that can commonly
substitute for one another in proteins may have
positive score.
20
#
#
#
#
#
#
Matrix made by matblas from blosum62.iij
* column uses minimum score
BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
Blocks Database = /data/blocks_5.0/blocks.dat
Cluster Percentage: >= 62
Entropy = 0.6979, Expected = -0.5209
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
21
BLAST Scoring
Nominal HSP scores (S) are sums of scores from substitution
matrices.
"
"
Nominal scores are normalized to give 'bit scores' (S'):
(I)
1Allows
S' =
⇔SΒ ln K
ln 2
andλλare
arestatistical
statisticalparameters
parameters
KKand
thatrelate
relatethe
thecalculated
calculatedscore
score
that
theprobability
probabilityfinding
findingaahit
hit
totothe
withatatleast
leastthat
thatscore.
score.
with
comparison of alignments scored by different methods
22
Relating Scores to Probability:
E-Values and P-Values
The expected number of HSPs with scores of at least S is
given by the following equations:
(II)
E= Kmn e
(III)
E= mn 2
Β ⇔S
Β S'
andλλare
arestatistical
statistical
KKand
'normalization'parameters.
parameters.
'normalization'
andnnare
arethe
thelengths
lengthsofofthe
the
mmand
query
query
sequenceand
anddatabase.
database.
sequence
nominalscore.
score.
SSisisa anominal
bitscore.
score.
S'S'isisa abit
P-Values are the likelihood of finding a match with a
score of at least S:
(IV)
ΒE
P= 1Β e
23
Substitution Matrices
Scores in the substitution matrix are expressed
in 'log-odds' format:
(V)
1
1
s ij = ln
q ij
pi p j
/⇔
targetfrequency
frequency
qqij=ij=target
pi,pjpj==frequency
frequencythose
those
pi,
residuesappear
appearby
bychance
chance
residues
normalizationparameter
parameter
λλ==normalization
The more frequently the substitution occurs, the
higher the score.
The less frequently the residue occurs in the sequence
as a whole, the higher the score.
24
Substitution Matrices
"
"
"
Derived from empirically observed substitution
frequencies
Higher scores for substitution with similar
residues.
Random substitutions give negative scores
25
Types of Substitution Matrices
Each tailored to a specific degree of evolutionary
divergence.
PAM Matrices:
'Percent Accepted Mutation'
start with closely related sequences, and extrapolate
substitution probabilities for more distantly related
sequences.
1 PAM unit=1 mutation event per 100 bases.
e.g.: PAM 100 tailored for 100 mutation events per
100 bases.
Barker, W.C. & Dayhoff, M.O. Atlas of Protein Sequence and Structure,
pp 101-110, National Biomedical Research Foundation (1972).
26
Types of Substitution Matrices
BLOSUM Matrices
'BLOck SUbstitution Matrix'
Values inferred from sequences sharing a maximum
of the given value.
e.g.: BLOSUM62 derived from sequences no more
than 62% identical.
Henikoff, S. & Henikoff, J.G., Proc. Natl. Acad. Sci., USA,
89, 10915-10919 (1992).
27
Comparing Substitution
Matrices
Similar Evolutionary Distances
PAM 120<----> BLOSUM80
PAM160<----> BLOSUM62
PAM250 <----> BLOSUM45
BLOSUM more tolerant to hydrophobic than PAM
but less tolerant to hydrophilic substitutions.
28
#
#
#
#
#
#
Matrix made by matblas from blosum62.iij
* column uses minimum score
BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
Blocks Database = /data/blocks_5.0/blocks.dat
Cluster Percentage: >= 62
Entropy = 0.6979, Expected = -0.5209
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
29
BLAST Databases at NCBI:
Da ta ba s e
nr
m o nth
dbe s t
dbs ts
m o us e e s ts
hum an e s ts
o the r e s ts
ye as t
E. c o li
pdb
ka b a t
[ka b a tn u c ]
p a te n ts
v e c to r
s wis s pro t
a lu
De s c riptio n
All n o n -re d u n d a n t G e n Ba n k+E MBL+DDBJ +P DB s e q u e n c e s (b u t n o E S T,
S TS , G S S , o r HTG S s e q u e n c e s ).
All n e w o r re vis e d G e n Ba n k+E MBL+DDBJ +P DB s e q u e n c e s re le a s e d in
th e la s t 3 0 d a ys .
No n -re d u n d a n t d a ta b a s e o f Ge n Ba n k+EMBL+DDBJ ES T Divis io n s .
No n -re d u n d a n t d a ta b a s e o f Ge n Ba n k+EMBL+DDBJ S TS Divis io n s .
Th e n o n -re d u n d a n t Da ta b a s e o f G e n Ba n k+E MBL+DDBJ E S T Divis io n s
lim ite d to th e o rg a n is m m o u s e .
Th e No n -re d u n d a n t Da ta b a s e o f G e n Ba n k+E MBL+DDBJ E S T Divis io n s
lim ite d to th e o rg a n is m h u m a n .
Th e n o n -re d u n d a n t d a ta b a s e o f G e n Ba n k+E MBL+DDBJ E S T Divis io n s
a ll o rg a n is m s e xc e p t m o u s e a n d h u m a n .
Ye a s t (S a c c h a ro m yc e s c e re vis ia e ) g e n o m ic n u c le o tid e s e q u e n c e s . No t a
c o lle c tio n o f a ll Ye a s t n u c e lo tid e s s e q u e n c e s , b u t th e s e q u e n c e fra g m e n ts
fro m th e Ye a s t c o m p le te g e n o m e .
E. c o li (Es c h e ric h ia c o li) g e n o m ic n u c le o tid e s e q u e n c e s .
S e q u e n c e s d e rive d fro m th e 3 -d im e n s io n a l s tru c tu re o f p ro te in s .
Ka b a t's d a ta b a s e o f s e q u e n c e s o f im m u n o lo g ic a l in te re s t. F o r m o re
in fo rm a tio n h ttp ://im m u n o .b m e .n wu .e d u /
Nu c le o tid e s e q u e n c e s d e rive d fro m th e P a te n t d ivis io n o f Ge n Ba n k.
Ve c to r s u b s e t o f G e n Ba n k(R ), NC BI, (ftp ://n c b i.n lm .n ih .g o v/p u b /b la s t/d b /
d ire c to ry).
Th e la s t m a jo r re le a s e o f th e S W IS S -P R O T p ro te in s e q u e n c e d a ta b a s e
(n o u p d a te s ). Th e s e a re u p lo a d e d to o u r s ys te m wh e n th e y a re re c e ive d
fro m E MBL.
Tra n s la tio n s o f s e le c t Alu re p e a ts fro m R E P BAS E , s u ita b le fo r m a s kin g
Alu re p e a ts fro m q u e ry s e q u e n c e s . It is a va ila b le a t
ftp ://n c b i.n lm .n ih .g o v/p u b /jm c /a lu . S e e "Alu a le rt" b y
Cla ve rie a n d Ma ka lo ws ki, Na tu re vo l. 3 7 1 , p a g e 7 5 2 (1 9 9 4 ) .
DNA
Pro te in
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
Adapted From:http://www.ncbi.nlm.nih.gov/Education/B LASTinfo/query_tutorial.html
30
Interpreting Blast Results
>gi|6580755|gb|AAF18265.1|U22895_1 (U22895) alternative sigma factor AlgU [Azotobacter vinelandii]
Length = 193
Score = 334 bits (857), Expect = 2e-91
Identities = 180/192 (93%), Positives = 189/192 (97%)
Query: 1
MLTQEQDQQLVERVQRGDKRAFDLLVLKYQHKILGLIVRFVHDAQEAQDVAQEAFIKAYR 60
ML QEQDQQLVERVQRGD+RAFDLLVLKYQHKILGLIVRFVHDA EAQDVAQEAFIKAYR
Sbjct: 1 MLNQEQDQQLVERVQRGDRRAFDLLVLKYQHKILGLIVRFVHDAHEAQDVAQEAFIKAYR
60
Query: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVTAEDAEFFEGDHALKDIESPE
120
ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDV+A DAEF+EGDHALKDIESPE
Sbjct: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVSAGDAEFYEGDHALKDIESPE
120
Query: 121 RAMLRDEIEATVHQTIQQLPEDLRTALTLREFEGLSYEDIATVMQCPVGTVRSRIFRARE
180
R++LRDEIEATVH+TIQQLPEDLRTALTLREF+GLSYEDIA+VMQCPVGTVRSRIFRARE
Sbjct: 121 RSLLRDEIEATVHRTIQQLPEDLRTALTLREFDGLSYEDIASVMQCPVGTVRSRIFRARE 180
Query: 181 AIDKALQPLLRE 192
AIDKALQPLL+E
31
Interpreting Blast Results
Hit name
>gi|6580755|gb|AAF18265.1|U22895_1 (U22895) alternative sigma factor AlgU [Azotobacter vinelandii]
Length = 193
Score = 334 bits (857), Expect = 2e-91
Identities = 180/192 (93%), Positives = 189/192 (97%)
Query: 1
MLTQEQDQQLVERVQRGDKRAFDLLVLKYQHKILGLIVRFVHDAQEAQDVAQEAFIKAYR 60
ML QEQDQQLVERVQRGD+RAFDLLVLKYQHKILGLIVRFVHDA EAQDVAQEAFIKAYR
Sbjct: 1 MLNQEQDQQLVERVQRGDRRAFDLLVLKYQHKILGLIVRFVHDAHEAQDVAQEAFIKAYR
60
Query: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVTAEDAEFFEGDHALKDIESPE
120
ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDV+A DAEF+EGDHALKDIESPE
Sbjct: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVSAGDAEFYEGDHALKDIESPE
120
Query: 121 RAMLRDEIEATVHQTIQQLPEDLRTALTLREFEGLSYEDIATVMQCPVGTVRSRIFRARE
Alignment with query
180
sequence
R++LRDEIEATVH+TIQQLPEDLRTALTLREF+GLSYEDIA+VMQCPVGTVRSRIFRARE
Sbjct: 121 RSLLRDEIEATVHRTIQQLPEDLRTALTLREFDGLSYEDIASVMQCPVGTVRSRIFRARE 180
Query: 181 AIDKALQPLLRE 192
AIDKALQPLL+E
32
Interpreting Blast Results
Normalized bit Nominal (U22895)
HSP alternative
Expectation
>gi|6580755|gb|AAF18265.1|U22895_1
sigma factor AlgU [Azotobacter vinelandii]
Length = 193
scores
scores
value
Score = 334 bits (857), Expect = 2e-91
Identities = 180/192 (93%), Positives = 189/192 (97%)
Query: 1
MLTQEQDQQLVERVQRGDKRAFDLLVLKYQHKILGLIVRFVHDAQEAQDVAQEAFIKAYR 60
Number of
ML QEQDQQLVERVQRGD+RAFDLLVLKYQHKILGLIVRFVHDA
EAQDVAQEAFIKAYR
Number of
Sbjct: 1 MLNQEQDQQLVERVQRGDRRAFDLLVLKYQHKILGLIVRFVHDAHEAQDVAQEAFIKAYR
Identities
Identities
60
Query: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVTAEDAEFFEGDHALKDIESPE
120
ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDV+A DAEF+EGDHALKDIESPE
Sbjct: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVSAGDAEFYEGDHALKDIESPE
120
Query: 121 RAMLRDEIEATVHQTIQQLPEDLRTALTLREFEGLSYEDIATVMQCPVGTVRSRIFRARE
180
R++LRDEIEATVH+TIQQLPEDLRTALTLREF+GLSYEDIA+VMQCPVGTVRSRIFRARE
Sbjct: 121 RSLLRDEIEATVHRTIQQLPEDLRTALTLREFDGLSYEDIASVMQCPVGTVRSRIFRARE 180
Query: 181 AIDKALQPLLRE 192
AIDKALQPLL+E
33
BLAST: On the Net,
and On Your Computer
Advantages/Disadvantages of Net Based Blast:
(1) Use databases hosted remotely at NCBI.
(2) Little/No setup required.
(3) But, Cannot use a customized database.
Advantages/Disadvantages of Local Microcomputer-Based Blast:
(1) Can Use a Customized Database.
(2) Better suited to scripting / automation or when a large number
of queries will be performed (UNIX).
(3) But, Requires some setup and computer expertise.
34
BLAST: On the Net,
and On Your Computer
On the Net:
http://www.ncbi.nlm.nih.gov/BLAST/
On Your Computer:
UNIX/MacOS/Windows
ftp://ncbi.nlm.nih.gov/blast/executables/
NCBI Tools for UNIX
ftp://ncbi.nlm.nih.gov/toolbox/
WUBLAST
http://blast.wustl.edu
35
Learning More about BLAST
How Blast Works:
Altschul, S.F. et al., Nucleic Acids Research, 25, 3389-3402 (1997).
Scoring Schemes:
Karlin, S., and Altschul, S.F., Proc. Natl. Acad. Sci., 87, 2264-2268
(1990).
Henikoff, S., and Henikoff, J.G., Proc. Natl. Acad. Sci., 89, 10915-10919
(1992).
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Online Tutorial
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
36
Download