Recherche de similarités

advertisement
Searching for similarity in
a Database :
BLAST
Dominique CELLIER
University of Rouen - France
Atelier Biologie Informatique Statistique Sociolinguistique
Laboratoire de Mathématique Raphaël Salem UMR CNRS 6085
Dominique.Cellier@univ-rouen.fr
1 - Comparing a sequence with a data bank
The problem of searching for nucleic or protein similarities has led many investigators to
abandon dynamic programming algorithms for which the size of the problem has become too large.
What are the problems connected to the search for similarity in a databank where it is now a
question of comparing a query sequence to all those contained in the bank?
 The size of the databanks.
 The complexity of the exact dynamic programming algorithms.
 The required computing time and storage space become too large!
 The statistical significance:
 sequences of different compositions,
 sequences of different lengths,
 The p-value and E-value are not comparable!
1.1 - What solutions exist ?
It is necessary to design very fast computational procedures that produce answers that are
"nearly" correct with respect to a formally stated optimisation criterion : heuristic algorithms.
 effective algorithms allowing comparisons and alignments in a reasonable time
and of a reasonable with cost: they are the heuristic algorithms BLAST and
FASTA.
 an automatic procedure based on the statistical significance of the alignments:
 keeping only the most interesting alignments,
 and sorting out the results according to their statistical
significance.
Searching for similarity in a Database: BLAST
Dominique Cellier
1.2 - The statistical significance of the results
This is studied homogenizing the pairwise comparison results to make the statistical
signifances between them comparable:
 The E-value
Et  KmN exp( t )
where N is the size of the databank, the constants K and  being adapted to the databank.
 The p-value:
p  value  1  exp KmN exp  t 
 The Z-score:
Zt 
tm
e
where m and e indicate the average and the standard deviation of the random score.
2 - FASTA
The FASTA software, developed by Pearson and Lipman (1988), is based on an algorithm of fast
identification of identity regions between the query sequence and the sequences of the databank. This
recognition allows only sequences presenting a region of strong similarity with the query sequence to
be considered. One can then apply (locally) to these sequences an algorithm of optimal alignment.
FASTA includes in fact two programs:
 The FASTA program: with a nucleic and protein version.
 The TFASTA program: search for a protein sequence with the sequences of a nucleic
bank translated in the six phases.
3 - BLAST - BLAST2
It is probably the most widely used database search program.
BLAST (Basic Local Alignment Search Tool) is a program developed in the NCBI based on
statistical methods introduced by Altschul and al. (1990), Karlin and Altschul (1993) for local
comparisons without insertion - deletion. The fundamental unit of BLAST is the HSP (High scoring
Segment Pair). An HSP corresponds to a region of the longest possible resemblance between two
sequences having a score at least equal to a threshold score.
This software contains in fact five different programs:





BLASTn: nucleotide sequence against nucleic databank.
BLASTp: amino acid sequence against protein databank.
BLASTx: nucleotide sequence translated in all reading frames against protein
databank.
TBLASTn: protein sequence against nucleotide sequence databank.
TBLASTx: Compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database
The statistical significance of the alignments is estimated according to the length and the
composition of the sequence, the size and the composition of the databank and the scoring system
used.
The BLAST2 version brought a considerable change by the integration of gaps in the
alignments.
2
Searching for similarity in a Database: BLAST
Dominique Cellier
4 - BLAST algorithm
3
Searching for similarity in a Database: BLAST
Dominique Cellier
5 - Setting up a BLAST search
 Step 1. Plan the search.
Decide the goal of the comparison with respect to a biological question:
 Is the query sequence represented in the database?
 Are there homologous or evolutionary relatives of the query sequence in the
database?
 Are there proteins whose function is related to the query sequence?
 
 Step 2. Choose the program to use.
Program
Description
blastn
Compares a nucleotide query sequence against a nucleotide
sequence database.
blastp Compares an amino acid query sequence against a protein
sequence database.
blastx Compares a nucleotide query sequence translated in all reading
frames against a protein sequence database. You could use this
option to find potential translation products of an unknown
nucleotide sequence.
tblastn Compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames.
tblastx Compares the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide
sequence database.
 Step 3. Choose the databank to search.
The database is selected according to the biological problem and the program we want to use.




DNA database for blastn, tblastn and tblastx
Protein database for blastp and blastx
Inclusive (e.g. nr nonredundant), organism-specific (e.g. yeast) or specialized
(e.g. dbEST) databases.

 Step 4. Enter the query sequence .
BLAST accepts input sequences in three formats:
 FASTA sequence format,
 NCBI Accession numbers,
 GI.
4
Searching for similarity in a Database: BLAST
Dominique Cellier
 FASTA format description
A sequence in FASTA format begins with a single-line description, followed by lines of sequence
data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the
first column. It is recommended that all lines of text be shorter than 80 characters in length. An
example sequence in FASTA format is:
> seq
cggagagagataggcttcgctagacaatttcttcaatctctggaagaaga
agttggtctgagccttcatcaacgttgttctttcgggaatgggcactgaa
acagtttcaaaacctgtgatggataatgggtctggagacagtgatgatga
caagcctttagcgttcaagaggaataatacagtggcttctaattcgaatc
aatctaaatccaattcccagagaagcaaggcagttcctaccaccaaggta
tcacctatgagatcacctgtgactagcccaaatggaaccactccttcgaa
taaaacttctatagtgaaatcctctatgccatcatcttcttctaaggctt
caccagcaaagtcaccattgcggaatgatatgccctctactgttaaggat
aggagccagttacagaaagatcagtctgaatgtaaaattgagcatgagga
ttctgaggatgatagacctttaagttccatactatctggaaataaagggc
caacctcttcgcggcaggtttcttcaccgcagccagagaaaaagaataat
ggtgatcgacctctt
 NCBI Accession numbers or GIs
 Step 5. Choose the scoring system:
The results a local alignment program produces depend strongly upon the scores it uses. No
single scoring scheme is best for all purposes, and an understanding of the basic theory of local
alignment scores can improve the sensitivity of one's sequence analyses.
 The choice of substitution scores
 DNA:
 Proteins: (BLOSUM62 is the default choice in BLAST).
 Gap opening and gap extension penalties are often chosen empirically.
An ungapped search may be desirable when hits that align to the entire length of the query are
most interesting. An ungapped search can be specified by checking the ungapped option or by
increasing the gap existence.
5
Searching for similarity in a Database: BLAST
Dominique Cellier
 Step 6. Set the other program options or choose defaults.




Filter: repetitive or low-complexity regions of the query sequence can be filtered
using SEG (protein) or DUST (nucleic) programs. One can also use RepeatMasker
before the Blast search.
Word size.
E-value.
Alignments.
 Step 7. Set the output formatting options:
Number of alignments, alignment view and graphical overview .
 Step 8. Run the search !!
6 - Deciphering the Output
Reference:
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
RID: 1000891070-1043-19629
Query= seq
(3106 letters)
Database: nt
952,825 sequences; 4,029,116,513 total letters
graphical overview
6
Searching for similarity in a Database: BLAST
Dominique Cellier
Descriptions
The description lines are sorted by increasing E value, thus the most significant alignments
(lowest E values) are on the top.
The description consists of four columns (from the left):
1. Identifier for the database sequence
2. Brief description of the sequence
3. The score (bits) of the highest-scoring HSP found in each database sequence
4. The E value
Sequences producing significant alignments:
gi|16557|emb|X57544.1|ATTOP1 A.thaliana TOP1 mRNA for topoi...
gi|15237134|ref|NC_003076.1| Arabidopsis thaliana chromosom...
gi|3241927|dbj|AB015479.1|AB015479 Arabidopsis thaliana gen...
gi|15233324|ref|NC_003075.1| Arabidopsis thaliana chromosom...
gi|7269481|emb|AL161565.2|ATCHRIV65 Arabidopsis thaliana DN...
gi|4756963|emb|AL035440.2|ATF10M23 Arabidopsis thaliana DNA...
gi|4165153|gb|AF115482.1|AF115482 Nicotiana tabacum topoiso...
gi|1800220|gb|U60440.1|DCU60440 Daucus carota topoisomerase...
gi|5326993|emb|AJ223326.1|DCAJ3326 Daucus carota mRNA for D...
gi|2330648|emb|Y14558.1|PSY14558 Pisum sativum mRNA for top...
gi|14626486|gb|AY038803.1| Nicotiana tabacum topoisomerase ...
gi|14318439|ref|NC_001147.2| Saccharomyces cerevisiae chrom...
gi|1419770|emb|Z74748.1|SCYOL006C S.cerevisiae chromosome X...
gi|173003|gb|K03077.1|YSCTOPI Yeast (S.cerevisiae) topoisom...
gi|1914416|emb|Z93385.1|CEM01E5 Caenorhabditis elegans cosm...
gi|1934846|emb|X96762.1|CETOPOI C.elegans mRNA for DNA topo...
gi|15217430|ref|NC_003070.1| Arabidopsis thaliana chromosom...
gi|5430744|gb|AC007519.2|F16N3 Sequence of BAC F16N3 from A...
gi|790481|emb|X83758.1|PFTOPOI P.falciparum topoisomerase I...
gi|5688863|dbj|AB030586.1|AB030586 Arabidopsis thaliana AAT..


gi|1015495|emb|Z54334.1|HSL139H8 Human DNA sequence from co...
gi|1762118|gb|U41342.1|CAU41342 Candida albicans topoisomer...
gi|214833|gb|L07777.1|XELTOPOIS Xenopus laevis DNA topoisom...
7
Score
bits
3092
1780
1780
163
163
163
100
98
96
88
74
54
54
54
48
48
46
46
46
46


42
42
42
Evalue
0.0
0.0
0.0
1e-36
1e-36
1e-36
1e-17
5e-17
2e-16
5e-14
7e-10
7e-04
7e-04
7e-04
0.042
0.042
0.17
0.17
0.17
0.17


2.6
2.6
2.6
Searching for similarity in a Database: BLAST
Dominique Cellier
Alignments
Alignments can be represented in a variety of formats selected by the user either before or
after the query is submitted.
The default format is the "pairwise alignment" in which the align positions of the query and
the database match are arranged with one vertical space between them.
>gi|16557|emb|X57544.1|ATTOP1 A.thaliana TOP1 mRNA for topoisomerase I
Length = 3106
Score = 3092 bits (1560), Expect = 0.0
Identities = 1560/1560 (100%)
Strand = Plus / Plus
Query: 1533 tgaagcaagaggagaaatatatgtgggctgttgttgatggtgtcaaagagaagattggta
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1533 tgaagcaagaggagaaatatatgtgggctgttgttgatggtgtcaaagagaagattggta

Query: 3033 aaatgttatgttatttgtaacattactatgattaaagaaatagaaaatccgaagaagaac
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 3033 aaatgttatgttatttgtaacattactatgattaaagaaatagaaaatccgaagaagaac
1592
1592
3092
3092
Score = 1873 bits (945), Expect = 0.0
Identities = 945/945 (100%)
Strand = Plus / Plus
Query: 1
cggagagagataggcttcgctagacaatttcttcaatctctggaagaagaagttggtctg 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1
cggagagagataggcttcgctagacaatttcttcaatctctggaagaagaagttggtctg 60

Query: 901 accaaaaatgaaagctaaacagttatctaccagagaagatggaac 945
|||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 901 accaaaaatgaaagctaaacagttatctaccagagaagatggaac 945
Score = 999 bits (504), Expect = 0.0
Identities = 504/504 (100%)
Strand = Plus / Plus
Query: 969
ttccgatatccaagagattcaaatcagattcctccaacagtaacacatcatctgcaaagc 1028
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 969 ttccgatatccaagagattcaaatcagattcctccaacagtaacacatcatctgcaaagc 1028

Query: 1449 ccatatatgagtggcatttggaag 1472
||||||||||||||||||||||||
Sbjct: 1449 ccatatatgagtggcatttggaag 1472
>gi|214833|gb|L07777.1|XELTOPOIS Xenopus laevis DNA topoisomerase I (TOP1) mRNA,
complete cds
Length = 4020
Score = 42.1 bits (21), Expect = 2.6
Identities = 33/37 (89%)
Strand = Plus / Plus
Query: 2697 tcaactacctggatcctagaatcacagttgcatggtg 2733
|||||||||||||||| |||||| | || ||||||||
Sbjct: 2441 tcaactacctggatcccagaatctctgtggcatggtg 2477
8
Searching for similarity in a Database: BLAST
Dominique Cellier
Review details of the search process
Database: nt
Posted date: Sep 17, 2001 11:09 PM
Number of letters in database: -265,850,779
Number of sequences in database: 952,825
Lambda
1.37
K
H
0.711
1.31
Gapped
Lambda
1.37
K
H
0.711
1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 8,145,359
Number of Sequences: 952825
Number of extensions: 8145359
Number of successful extensions: 44532
Number of sequences better than 10.0: 57
length of query: 3106
length of database: 4,029,116,513
effective HSP length: 22
effective length of query: 3084
effective length of database: 4,008,154,363
effective search space: 12361148055492
effective search space used: 12361148055492
T: 0
A: 30
X1: 6 (11.9 bits)
X2: 15 (29.7 bits)
S1: 12 (24.3 bits)
S2: 21 (42.1 bits)
9
Searching for similarity in a Database: BLAST
Dominique Cellier
7 - Some references
1. Attwood, T.K. and Parry-Smith, D.J. (1999). Introduction to bioinformatics. Longman.
2. Higgins, D. and Taylor, W. (2000). Biounformatics: Sequence, structure and databanks - a
practical approach. Oxford University Press.
3. Setubal, J. and Meidanis, J. (1997). Introduction to computational molecular biology. PWS
Publishing Company.
4. Waterman, M.S. (1995). Introduction to computational biology: maps, sequences and
genomes. Chapman & Hall.
Web addresses
5. Infobiogen Deambulum:
http://www.infobiogen.fr/services/deambulum/fr/index.html
BLAST:
6.
NCBI
7.
Infobiogen
http://www.ncbi.nlm.nih.gov/BLAST
http://www.infobiogen.fr/services/analyseq/cgi-bin/blast2_in.pl
10
Download