BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology 1 Topics to be covered: " BLAST as a Sequence Alignment Tool " Uses of BLAST " Types of BLAST " How BLAST works Scanning for 'hits' Scoring with Substitution Matrices " Common Databases for Use with BLAST available at NCBI " Interpretation of Blast Results " Blast options: on the net or on your computer " Learning More About BLAST, " A BLAST demo 2 gi|13325078|gb|AAG33875.2| (AF232004) HrpL [Pseudomonas syringae pv. tomato] Length = 184 Score = 347 bits (889), Expect = 4e-95 Identities = 182/184 (98%), Positives = 183/184 (98%) Query: 1 MFQKIVILDSTQPRQPSSSAGIRQMTADQIQMLRAFIQKRVMNPDDVDDILQCVFLEALR 60 MFQKIVILDSTQPRQPSSSAGIRQMTADQIQMLRAFIQKRVMNPDDVDDILQCVFLEALR Sbjct: 1 MFQKIVILDSTQPRQPSSSAGIRQMTADQIQMLRAFIQKRVMNPDDVDDILQCVFLEALR 60 Query: 61 NEHKFQHASKPQTWLCGIALNLIRNHFRKMYRQPYQESWEDEVHSELEGHGDVSHQVDGH 1 NEHKFQHASKPQTWLCGIALNLIRNHFRKMYRQPYQESWEDEVHSELEGHGDVSHQV+GH Sbjct: 61 NEHKFQHASKPQTWLCGIALNLIRNHFRKMYRQPYQESWEDEVHSELEGHGDVSHQVEGH 12 Query: 121 RQLARVIQAIDCLPSNMQKVLEVSLEMDGNYQETANSLGVPIGTVRSRLSRARVQLKQQI 180 RQLARVIQAIDCLPSNMQKVLEVSLEMDGNYQETANSLGVPIGTVRSRLS ARVQLKQQI Sbjct: 121 RQLARVIQAIDCLPSNMQKVLEVSLEMDGNYQETANSLGVPIGTVRSRLSGARVQLKQQI 180 Query: 181 DPFA 184 DPFA Sbjct: 181 DPFA 184 3 4 Sequence Alignment Tools Database Searching: BLAST: NCBI, Web Interface: http://www.ncbi.nlm.nih.gov/BLAST/ WuBLAST http://blast.wustl.edu FASTA: http://www.ebi.ac.uk/fasta3/ Smith-Waterman Par-Align: http://dna.uio.no/search/ Multiple Sequence Alignment: CLUSTALW: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html DiAlign, Web Interface: http://genomatix.gsf.de/cgi-bin/dialign/dialign.pl MSA:http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html Web Interface: http://bioweb.pasteur.fr/seqanal/interfaces/msa-simple.html 5 Uses of BLAST: Query a database for sequences similar to an input sequence. 6 Uses of BLAST: Query a database for sequences similar to an input sequence. " Identify previously characterized sequences. 7 Uses of BLAST: Query a database for sequences similar to an input sequence. " " Identify previously characterized sequences. Find phylogenetically related sequences. 8 Uses of BLAST: Query a database for sequences similar to an input sequence. " " " Identify previously characterized sequences. Find phylogenetically related sequences. Identify possible functions based on similarities to known sequences. 9 Types of BLAST: Graphic courtesy of Joel Graber. 10 How BLAST Works (1) BLAST scans database for 'words' of a predetermined length (a 'hit') with some minimum threshold parameter, T. (2) BLAST then extends the hit until the score falls below the maximum score yet attained minus some value X. Altschul, S. F. et al., Nucleic Acids Research, 25, 3389-3402 (1997) 11 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK 12 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK Use 2 or 3-letter words... MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK 13 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK Scan against subject sequence: >gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA 14 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK A hit! >gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA 15 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK >gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA Extension: Query: Sbjct: YFP YP YLP 16 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK >gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA Extension: Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTKVI M H A Y L S V W E R L V YP L E T I L Sbjct: MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPY 17 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK >gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA Extension: Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTKVI M H A Y L S V W E R L V YP L E T I L Sbjct: MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPY 18 Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTK >gi|507311|gb|AAA25685.1| aminoglycoside 6'-N-acetyltransferase MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPYIAMLNGEPIG SGDGWWEEETDPGVRGIDQSLANASQLGKGLGTKLVRALVELLFNDPEVTKIQTDPSPSNLR GFERQGTVTTPDGPAVYMVQTRQAFERTRSDA Extension: Query: MVSHSAAQAYSMLTNSEFVSMWSAESCRTPLCSVNNSYFPGALGDEKTTKVI M H A Y L S V W E R L V YP L E T I L Sbjct: MTEHDLAMLYEWLNRSHIVEWWGGEEARPTLADVQEQYLPSVLAQESVTPY HSP: A High-Scoring Segment Pair 19 Towards BLAST Scoring " " " Expected negative score for alignment of two random residues. Maximal score for a perfect match. Combinations of residues that can commonly substitute for one another in proteins may have positive score. 20 # # # # # # Matrix made by matblas from blosum62.iij * column uses minimum score BLOSUM Clustered Scoring Matrix in 1/2 Bit Units Blocks Database = /data/blocks_5.0/blocks.dat Cluster Percentage: >= 62 Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 21 BLAST Scoring Nominal HSP scores (S) are sums of scores from substitution matrices. " " Nominal scores are normalized to give 'bit scores' (S'): (I) 1Allows S' = ⇔SΒ ln K ln 2 andλλare arestatistical statisticalparameters parameters KKand thatrelate relatethe thecalculated calculatedscore score that theprobability probabilityfinding findingaahit hit totothe withatatleast leastthat thatscore. score. with comparison of alignments scored by different methods 22 Relating Scores to Probability: E-Values and P-Values The expected number of HSPs with scores of at least S is given by the following equations: (II) E= Kmn e (III) E= mn 2 Β ⇔S Β S' andλλare arestatistical statistical KKand 'normalization'parameters. parameters. 'normalization' andnnare arethe thelengths lengthsofofthe the mmand query query sequenceand anddatabase. database. sequence nominalscore. score. SSisisa anominal bitscore. score. S'S'isisa abit P-Values are the likelihood of finding a match with a score of at least S: (IV) ΒE P= 1Β e 23 Substitution Matrices Scores in the substitution matrix are expressed in 'log-odds' format: (V) 1 1 s ij = ln q ij pi p j /⇔ targetfrequency frequency qqij=ij=target pi,pjpj==frequency frequencythose those pi, residuesappear appearby bychance chance residues normalizationparameter parameter λλ==normalization The more frequently the substitution occurs, the higher the score. The less frequently the residue occurs in the sequence as a whole, the higher the score. 24 Substitution Matrices " " " Derived from empirically observed substitution frequencies Higher scores for substitution with similar residues. Random substitutions give negative scores 25 Types of Substitution Matrices Each tailored to a specific degree of evolutionary divergence. PAM Matrices: 'Percent Accepted Mutation' start with closely related sequences, and extrapolate substitution probabilities for more distantly related sequences. 1 PAM unit=1 mutation event per 100 bases. e.g.: PAM 100 tailored for 100 mutation events per 100 bases. Barker, W.C. & Dayhoff, M.O. Atlas of Protein Sequence and Structure, pp 101-110, National Biomedical Research Foundation (1972). 26 Types of Substitution Matrices BLOSUM Matrices 'BLOck SUbstitution Matrix' Values inferred from sequences sharing a maximum of the given value. e.g.: BLOSUM62 derived from sequences no more than 62% identical. Henikoff, S. & Henikoff, J.G., Proc. Natl. Acad. Sci., USA, 89, 10915-10919 (1992). 27 Comparing Substitution Matrices Similar Evolutionary Distances PAM 120<----> BLOSUM80 PAM160<----> BLOSUM62 PAM250 <----> BLOSUM45 BLOSUM more tolerant to hydrophobic than PAM but less tolerant to hydrophilic substitutions. 28 # # # # # # Matrix made by matblas from blosum62.iij * column uses minimum score BLOSUM Clustered Scoring Matrix in 1/2 Bit Units Blocks Database = /data/blocks_5.0/blocks.dat Cluster Percentage: >= 62 Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 29 BLAST Databases at NCBI: Da ta ba s e nr m o nth dbe s t dbs ts m o us e e s ts hum an e s ts o the r e s ts ye as t E. c o li pdb ka b a t [ka b a tn u c ] p a te n ts v e c to r s wis s pro t a lu De s c riptio n All n o n -re d u n d a n t G e n Ba n k+E MBL+DDBJ +P DB s e q u e n c e s (b u t n o E S T, S TS , G S S , o r HTG S s e q u e n c e s ). All n e w o r re vis e d G e n Ba n k+E MBL+DDBJ +P DB s e q u e n c e s re le a s e d in th e la s t 3 0 d a ys . No n -re d u n d a n t d a ta b a s e o f Ge n Ba n k+EMBL+DDBJ ES T Divis io n s . No n -re d u n d a n t d a ta b a s e o f Ge n Ba n k+EMBL+DDBJ S TS Divis io n s . Th e n o n -re d u n d a n t Da ta b a s e o f G e n Ba n k+E MBL+DDBJ E S T Divis io n s lim ite d to th e o rg a n is m m o u s e . Th e No n -re d u n d a n t Da ta b a s e o f G e n Ba n k+E MBL+DDBJ E S T Divis io n s lim ite d to th e o rg a n is m h u m a n . Th e n o n -re d u n d a n t d a ta b a s e o f G e n Ba n k+E MBL+DDBJ E S T Divis io n s a ll o rg a n is m s e xc e p t m o u s e a n d h u m a n . Ye a s t (S a c c h a ro m yc e s c e re vis ia e ) g e n o m ic n u c le o tid e s e q u e n c e s . No t a c o lle c tio n o f a ll Ye a s t n u c e lo tid e s s e q u e n c e s , b u t th e s e q u e n c e fra g m e n ts fro m th e Ye a s t c o m p le te g e n o m e . E. c o li (Es c h e ric h ia c o li) g e n o m ic n u c le o tid e s e q u e n c e s . S e q u e n c e s d e rive d fro m th e 3 -d im e n s io n a l s tru c tu re o f p ro te in s . Ka b a t's d a ta b a s e o f s e q u e n c e s o f im m u n o lo g ic a l in te re s t. F o r m o re in fo rm a tio n h ttp ://im m u n o .b m e .n wu .e d u / Nu c le o tid e s e q u e n c e s d e rive d fro m th e P a te n t d ivis io n o f Ge n Ba n k. Ve c to r s u b s e t o f G e n Ba n k(R ), NC BI, (ftp ://n c b i.n lm .n ih .g o v/p u b /b la s t/d b / d ire c to ry). Th e la s t m a jo r re le a s e o f th e S W IS S -P R O T p ro te in s e q u e n c e d a ta b a s e (n o u p d a te s ). Th e s e a re u p lo a d e d to o u r s ys te m wh e n th e y a re re c e ive d fro m E MBL. Tra n s la tio n s o f s e le c t Alu re p e a ts fro m R E P BAS E , s u ita b le fo r m a s kin g Alu re p e a ts fro m q u e ry s e q u e n c e s . It is a va ila b le a t ftp ://n c b i.n lm .n ih .g o v/p u b /jm c /a lu . S e e "Alu a le rt" b y Cla ve rie a n d Ma ka lo ws ki, Na tu re vo l. 3 7 1 , p a g e 7 5 2 (1 9 9 4 ) . DNA Pro te in 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Adapted From:http://www.ncbi.nlm.nih.gov/Education/B LASTinfo/query_tutorial.html 30 Interpreting Blast Results >gi|6580755|gb|AAF18265.1|U22895_1 (U22895) alternative sigma factor AlgU [Azotobacter vinelandii] Length = 193 Score = 334 bits (857), Expect = 2e-91 Identities = 180/192 (93%), Positives = 189/192 (97%) Query: 1 MLTQEQDQQLVERVQRGDKRAFDLLVLKYQHKILGLIVRFVHDAQEAQDVAQEAFIKAYR 60 ML QEQDQQLVERVQRGD+RAFDLLVLKYQHKILGLIVRFVHDA EAQDVAQEAFIKAYR Sbjct: 1 MLNQEQDQQLVERVQRGDRRAFDLLVLKYQHKILGLIVRFVHDAHEAQDVAQEAFIKAYR 60 Query: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVTAEDAEFFEGDHALKDIESPE 120 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDV+A DAEF+EGDHALKDIESPE Sbjct: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVSAGDAEFYEGDHALKDIESPE 120 Query: 121 RAMLRDEIEATVHQTIQQLPEDLRTALTLREFEGLSYEDIATVMQCPVGTVRSRIFRARE 180 R++LRDEIEATVH+TIQQLPEDLRTALTLREF+GLSYEDIA+VMQCPVGTVRSRIFRARE Sbjct: 121 RSLLRDEIEATVHRTIQQLPEDLRTALTLREFDGLSYEDIASVMQCPVGTVRSRIFRARE 180 Query: 181 AIDKALQPLLRE 192 AIDKALQPLL+E 31 Interpreting Blast Results Hit name >gi|6580755|gb|AAF18265.1|U22895_1 (U22895) alternative sigma factor AlgU [Azotobacter vinelandii] Length = 193 Score = 334 bits (857), Expect = 2e-91 Identities = 180/192 (93%), Positives = 189/192 (97%) Query: 1 MLTQEQDQQLVERVQRGDKRAFDLLVLKYQHKILGLIVRFVHDAQEAQDVAQEAFIKAYR 60 ML QEQDQQLVERVQRGD+RAFDLLVLKYQHKILGLIVRFVHDA EAQDVAQEAFIKAYR Sbjct: 1 MLNQEQDQQLVERVQRGDRRAFDLLVLKYQHKILGLIVRFVHDAHEAQDVAQEAFIKAYR 60 Query: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVTAEDAEFFEGDHALKDIESPE 120 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDV+A DAEF+EGDHALKDIESPE Sbjct: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVSAGDAEFYEGDHALKDIESPE 120 Query: 121 RAMLRDEIEATVHQTIQQLPEDLRTALTLREFEGLSYEDIATVMQCPVGTVRSRIFRARE Alignment with query 180 sequence R++LRDEIEATVH+TIQQLPEDLRTALTLREF+GLSYEDIA+VMQCPVGTVRSRIFRARE Sbjct: 121 RSLLRDEIEATVHRTIQQLPEDLRTALTLREFDGLSYEDIASVMQCPVGTVRSRIFRARE 180 Query: 181 AIDKALQPLLRE 192 AIDKALQPLL+E 32 Interpreting Blast Results Normalized bit Nominal (U22895) HSP alternative Expectation >gi|6580755|gb|AAF18265.1|U22895_1 sigma factor AlgU [Azotobacter vinelandii] Length = 193 scores scores value Score = 334 bits (857), Expect = 2e-91 Identities = 180/192 (93%), Positives = 189/192 (97%) Query: 1 MLTQEQDQQLVERVQRGDKRAFDLLVLKYQHKILGLIVRFVHDAQEAQDVAQEAFIKAYR 60 Number of ML QEQDQQLVERVQRGD+RAFDLLVLKYQHKILGLIVRFVHDA EAQDVAQEAFIKAYR Number of Sbjct: 1 MLNQEQDQQLVERVQRGDRRAFDLLVLKYQHKILGLIVRFVHDAHEAQDVAQEAFIKAYR Identities Identities 60 Query: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVTAEDAEFFEGDHALKDIESPE 120 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDV+A DAEF+EGDHALKDIESPE Sbjct: 61 ALGNFRGDSAFYTWLYRIAINTAKNHLVARGRRPPDSDVSAGDAEFYEGDHALKDIESPE 120 Query: 121 RAMLRDEIEATVHQTIQQLPEDLRTALTLREFEGLSYEDIATVMQCPVGTVRSRIFRARE 180 R++LRDEIEATVH+TIQQLPEDLRTALTLREF+GLSYEDIA+VMQCPVGTVRSRIFRARE Sbjct: 121 RSLLRDEIEATVHRTIQQLPEDLRTALTLREFDGLSYEDIASVMQCPVGTVRSRIFRARE 180 Query: 181 AIDKALQPLLRE 192 AIDKALQPLL+E 33 BLAST: On the Net, and On Your Computer Advantages/Disadvantages of Net Based Blast: (1) Use databases hosted remotely at NCBI. (2) Little/No setup required. (3) But, Cannot use a customized database. Advantages/Disadvantages of Local Microcomputer-Based Blast: (1) Can Use a Customized Database. (2) Better suited to scripting / automation or when a large number of queries will be performed (UNIX). (3) But, Requires some setup and computer expertise. 34 BLAST: On the Net, and On Your Computer On the Net: http://www.ncbi.nlm.nih.gov/BLAST/ On Your Computer: UNIX/MacOS/Windows ftp://ncbi.nlm.nih.gov/blast/executables/ NCBI Tools for UNIX ftp://ncbi.nlm.nih.gov/toolbox/ WUBLAST http://blast.wustl.edu 35 Learning More about BLAST How Blast Works: Altschul, S.F. et al., Nucleic Acids Research, 25, 3389-3402 (1997). Scoring Schemes: Karlin, S., and Altschul, S.F., Proc. Natl. Acad. Sci., 87, 2264-2268 (1990). Henikoff, S., and Henikoff, J.G., Proc. Natl. Acad. Sci., 89, 10915-10919 (1992). http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Online Tutorial http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html 36