Searching for similarity in a Database : BLAST Dominique CELLIER University of Rouen - France Atelier Biologie Informatique Statistique Sociolinguistique Laboratoire de Mathématique Raphaël Salem UMR CNRS 6085 Dominique.Cellier@univ-rouen.fr 1 - Comparing a sequence with a data bank The problem of searching for nucleic or protein similarities has led many investigators to abandon dynamic programming algorithms for which the size of the problem has become too large. What are the problems connected to the search for similarity in a databank where it is now a question of comparing a query sequence to all those contained in the bank? The size of the databanks. The complexity of the exact dynamic programming algorithms. The required computing time and storage space become too large! The statistical significance: sequences of different compositions, sequences of different lengths, The p-value and E-value are not comparable! 1.1 - What solutions exist ? It is necessary to design very fast computational procedures that produce answers that are "nearly" correct with respect to a formally stated optimisation criterion : heuristic algorithms. effective algorithms allowing comparisons and alignments in a reasonable time and of a reasonable with cost: they are the heuristic algorithms BLAST and FASTA. an automatic procedure based on the statistical significance of the alignments: keeping only the most interesting alignments, and sorting out the results according to their statistical significance. Searching for similarity in a Database: BLAST Dominique Cellier 1.2 - The statistical significance of the results This is studied homogenizing the pairwise comparison results to make the statistical signifances between them comparable: The E-value Et KmN exp( t ) where N is the size of the databank, the constants K and being adapted to the databank. The p-value: p value 1 exp KmN exp t The Z-score: Zt tm e where m and e indicate the average and the standard deviation of the random score. 2 - FASTA The FASTA software, developed by Pearson and Lipman (1988), is based on an algorithm of fast identification of identity regions between the query sequence and the sequences of the databank. This recognition allows only sequences presenting a region of strong similarity with the query sequence to be considered. One can then apply (locally) to these sequences an algorithm of optimal alignment. FASTA includes in fact two programs: The FASTA program: with a nucleic and protein version. The TFASTA program: search for a protein sequence with the sequences of a nucleic bank translated in the six phases. 3 - BLAST - BLAST2 It is probably the most widely used database search program. BLAST (Basic Local Alignment Search Tool) is a program developed in the NCBI based on statistical methods introduced by Altschul and al. (1990), Karlin and Altschul (1993) for local comparisons without insertion - deletion. The fundamental unit of BLAST is the HSP (High scoring Segment Pair). An HSP corresponds to a region of the longest possible resemblance between two sequences having a score at least equal to a threshold score. This software contains in fact five different programs: BLASTn: nucleotide sequence against nucleic databank. BLASTp: amino acid sequence against protein databank. BLASTx: nucleotide sequence translated in all reading frames against protein databank. TBLASTn: protein sequence against nucleotide sequence databank. TBLASTx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database The statistical significance of the alignments is estimated according to the length and the composition of the sequence, the size and the composition of the databank and the scoring system used. The BLAST2 version brought a considerable change by the integration of gaps in the alignments. 2 Searching for similarity in a Database: BLAST Dominique Cellier 4 - BLAST algorithm 3 Searching for similarity in a Database: BLAST Dominique Cellier 5 - Setting up a BLAST search Step 1. Plan the search. Decide the goal of the comparison with respect to a biological question: Is the query sequence represented in the database? Are there homologous or evolutionary relatives of the query sequence in the database? Are there proteins whose function is related to the query sequence? Step 2. Choose the program to use. Program Description blastn Compares a nucleotide query sequence against a nucleotide sequence database. blastp Compares an amino acid query sequence against a protein sequence database. blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Step 3. Choose the databank to search. The database is selected according to the biological problem and the program we want to use. DNA database for blastn, tblastn and tblastx Protein database for blastp and blastx Inclusive (e.g. nr nonredundant), organism-specific (e.g. yeast) or specialized (e.g. dbEST) databases. Step 4. Enter the query sequence . BLAST accepts input sequences in three formats: FASTA sequence format, NCBI Accession numbers, GI. 4 Searching for similarity in a Database: BLAST Dominique Cellier FASTA format description A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is: > seq cggagagagataggcttcgctagacaatttcttcaatctctggaagaaga agttggtctgagccttcatcaacgttgttctttcgggaatgggcactgaa acagtttcaaaacctgtgatggataatgggtctggagacagtgatgatga caagcctttagcgttcaagaggaataatacagtggcttctaattcgaatc aatctaaatccaattcccagagaagcaaggcagttcctaccaccaaggta tcacctatgagatcacctgtgactagcccaaatggaaccactccttcgaa taaaacttctatagtgaaatcctctatgccatcatcttcttctaaggctt caccagcaaagtcaccattgcggaatgatatgccctctactgttaaggat aggagccagttacagaaagatcagtctgaatgtaaaattgagcatgagga ttctgaggatgatagacctttaagttccatactatctggaaataaagggc caacctcttcgcggcaggtttcttcaccgcagccagagaaaaagaataat ggtgatcgacctctt NCBI Accession numbers or GIs Step 5. Choose the scoring system: The results a local alignment program produces depend strongly upon the scores it uses. No single scoring scheme is best for all purposes, and an understanding of the basic theory of local alignment scores can improve the sensitivity of one's sequence analyses. The choice of substitution scores DNA: Proteins: (BLOSUM62 is the default choice in BLAST). Gap opening and gap extension penalties are often chosen empirically. An ungapped search may be desirable when hits that align to the entire length of the query are most interesting. An ungapped search can be specified by checking the ungapped option or by increasing the gap existence. 5 Searching for similarity in a Database: BLAST Dominique Cellier Step 6. Set the other program options or choose defaults. Filter: repetitive or low-complexity regions of the query sequence can be filtered using SEG (protein) or DUST (nucleic) programs. One can also use RepeatMasker before the Blast search. Word size. E-value. Alignments. Step 7. Set the output formatting options: Number of alignments, alignment view and graphical overview . Step 8. Run the search !! 6 - Deciphering the Output Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. RID: 1000891070-1043-19629 Query= seq (3106 letters) Database: nt 952,825 sequences; 4,029,116,513 total letters graphical overview 6 Searching for similarity in a Database: BLAST Dominique Cellier Descriptions The description lines are sorted by increasing E value, thus the most significant alignments (lowest E values) are on the top. The description consists of four columns (from the left): 1. Identifier for the database sequence 2. Brief description of the sequence 3. The score (bits) of the highest-scoring HSP found in each database sequence 4. The E value Sequences producing significant alignments: gi|16557|emb|X57544.1|ATTOP1 A.thaliana TOP1 mRNA for topoi... gi|15237134|ref|NC_003076.1| Arabidopsis thaliana chromosom... gi|3241927|dbj|AB015479.1|AB015479 Arabidopsis thaliana gen... gi|15233324|ref|NC_003075.1| Arabidopsis thaliana chromosom... gi|7269481|emb|AL161565.2|ATCHRIV65 Arabidopsis thaliana DN... gi|4756963|emb|AL035440.2|ATF10M23 Arabidopsis thaliana DNA... gi|4165153|gb|AF115482.1|AF115482 Nicotiana tabacum topoiso... gi|1800220|gb|U60440.1|DCU60440 Daucus carota topoisomerase... gi|5326993|emb|AJ223326.1|DCAJ3326 Daucus carota mRNA for D... gi|2330648|emb|Y14558.1|PSY14558 Pisum sativum mRNA for top... gi|14626486|gb|AY038803.1| Nicotiana tabacum topoisomerase ... gi|14318439|ref|NC_001147.2| Saccharomyces cerevisiae chrom... gi|1419770|emb|Z74748.1|SCYOL006C S.cerevisiae chromosome X... gi|173003|gb|K03077.1|YSCTOPI Yeast (S.cerevisiae) topoisom... gi|1914416|emb|Z93385.1|CEM01E5 Caenorhabditis elegans cosm... gi|1934846|emb|X96762.1|CETOPOI C.elegans mRNA for DNA topo... gi|15217430|ref|NC_003070.1| Arabidopsis thaliana chromosom... gi|5430744|gb|AC007519.2|F16N3 Sequence of BAC F16N3 from A... gi|790481|emb|X83758.1|PFTOPOI P.falciparum topoisomerase I... gi|5688863|dbj|AB030586.1|AB030586 Arabidopsis thaliana AAT.. gi|1015495|emb|Z54334.1|HSL139H8 Human DNA sequence from co... gi|1762118|gb|U41342.1|CAU41342 Candida albicans topoisomer... gi|214833|gb|L07777.1|XELTOPOIS Xenopus laevis DNA topoisom... 7 Score bits 3092 1780 1780 163 163 163 100 98 96 88 74 54 54 54 48 48 46 46 46 46 42 42 42 Evalue 0.0 0.0 0.0 1e-36 1e-36 1e-36 1e-17 5e-17 2e-16 5e-14 7e-10 7e-04 7e-04 7e-04 0.042 0.042 0.17 0.17 0.17 0.17 2.6 2.6 2.6 Searching for similarity in a Database: BLAST Dominique Cellier Alignments Alignments can be represented in a variety of formats selected by the user either before or after the query is submitted. The default format is the "pairwise alignment" in which the align positions of the query and the database match are arranged with one vertical space between them. >gi|16557|emb|X57544.1|ATTOP1 A.thaliana TOP1 mRNA for topoisomerase I Length = 3106 Score = 3092 bits (1560), Expect = 0.0 Identities = 1560/1560 (100%) Strand = Plus / Plus Query: 1533 tgaagcaagaggagaaatatatgtgggctgttgttgatggtgtcaaagagaagattggta |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1533 tgaagcaagaggagaaatatatgtgggctgttgttgatggtgtcaaagagaagattggta Query: 3033 aaatgttatgttatttgtaacattactatgattaaagaaatagaaaatccgaagaagaac |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 3033 aaatgttatgttatttgtaacattactatgattaaagaaatagaaaatccgaagaagaac 1592 1592 3092 3092 Score = 1873 bits (945), Expect = 0.0 Identities = 945/945 (100%) Strand = Plus / Plus Query: 1 cggagagagataggcttcgctagacaatttcttcaatctctggaagaagaagttggtctg 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1 cggagagagataggcttcgctagacaatttcttcaatctctggaagaagaagttggtctg 60 Query: 901 accaaaaatgaaagctaaacagttatctaccagagaagatggaac 945 ||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 901 accaaaaatgaaagctaaacagttatctaccagagaagatggaac 945 Score = 999 bits (504), Expect = 0.0 Identities = 504/504 (100%) Strand = Plus / Plus Query: 969 ttccgatatccaagagattcaaatcagattcctccaacagtaacacatcatctgcaaagc 1028 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 969 ttccgatatccaagagattcaaatcagattcctccaacagtaacacatcatctgcaaagc 1028 Query: 1449 ccatatatgagtggcatttggaag 1472 |||||||||||||||||||||||| Sbjct: 1449 ccatatatgagtggcatttggaag 1472 >gi|214833|gb|L07777.1|XELTOPOIS Xenopus laevis DNA topoisomerase I (TOP1) mRNA, complete cds Length = 4020 Score = 42.1 bits (21), Expect = 2.6 Identities = 33/37 (89%) Strand = Plus / Plus Query: 2697 tcaactacctggatcctagaatcacagttgcatggtg 2733 |||||||||||||||| |||||| | || |||||||| Sbjct: 2441 tcaactacctggatcccagaatctctgtggcatggtg 2477 8 Searching for similarity in a Database: BLAST Dominique Cellier Review details of the search process Database: nt Posted date: Sep 17, 2001 11:09 PM Number of letters in database: -265,850,779 Number of sequences in database: 952,825 Lambda 1.37 K H 0.711 1.31 Gapped Lambda 1.37 K H 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Hits to DB: 8,145,359 Number of Sequences: 952825 Number of extensions: 8145359 Number of successful extensions: 44532 Number of sequences better than 10.0: 57 length of query: 3106 length of database: 4,029,116,513 effective HSP length: 22 effective length of query: 3084 effective length of database: 4,008,154,363 effective search space: 12361148055492 effective search space used: 12361148055492 T: 0 A: 30 X1: 6 (11.9 bits) X2: 15 (29.7 bits) S1: 12 (24.3 bits) S2: 21 (42.1 bits) 9 Searching for similarity in a Database: BLAST Dominique Cellier 7 - Some references 1. Attwood, T.K. and Parry-Smith, D.J. (1999). Introduction to bioinformatics. Longman. 2. Higgins, D. and Taylor, W. (2000). Biounformatics: Sequence, structure and databanks - a practical approach. Oxford University Press. 3. Setubal, J. and Meidanis, J. (1997). Introduction to computational molecular biology. PWS Publishing Company. 4. Waterman, M.S. (1995). Introduction to computational biology: maps, sequences and genomes. Chapman & Hall. Web addresses 5. Infobiogen Deambulum: http://www.infobiogen.fr/services/deambulum/fr/index.html BLAST: 6. NCBI 7. Infobiogen http://www.ncbi.nlm.nih.gov/BLAST http://www.infobiogen.fr/services/analyseq/cgi-bin/blast2_in.pl 10