InterPro as a new tool for whole genome analysis

advertisement
INTERPRO AS A NEW TOOL FOR WHOLE GENOME
ANALYSIS. A COMPARITIVE ANALYSIS OF MYCOBACTERIUM
TUBERCULOSIS, BACILLUS SUBTILIS AND ESCHERICHIA COLI
AS A CASE STUDY.
Apweiler R.*, Fleischmann W., Mulder N.J.
An expanded abstract for the Proceedings of the Second International Conference on
Bioinformatics of Genome Regulation and Structure.
EMBL Outstation – European Bioinformatics Institute,
Wellcome Trust Genome Campus,
Hinxton,
Cambridge,
CB 10 1SD,
United Kingdom
To whom correspondence should be addressed.
Tel: +44 (0)1223 494 435
Fax: +44 (0)1223 494 468
Email: rolf.apweiler@ebi.ac.uk
Key words: database, protein, domain, function, family, repeat, computer tool
Motivation: Several pattern-recognition methods have evolved to address different
protein sequence analysis problems, resulting in rather different and mostly independent
databases. InterPro was developed as a new integrated documentation resource for
protein families, domains and functional sites, to rationalise the complementary efforts
of the PROSITE, PRINTS, Pfam and ProDom database projects. InterPro has
applications in computational functional classification of newly determined sequences
lacking biochemical characterisation, and in comparative genome analysis.
Results:
The first release of InterPro was built from Pfam 5.0, PRINTS 25.0 and PROSITE 16.0.
and contains nearly 3000 entries, representing families, domains, repeats and PTMs
encoded by 4,879 different regular expressions, profiles, fingerprints and HMMs.
Overall, InterPro entries match more than 300,000 sequences in SWISS-PROT and
TrEMBL. This new resource provides an integrated view of the pattern databases, and
provides an intuitive interface for text- and sequence-based searches.
InterPro was used for whole proteome analysis of the pathogenic microorganism, M.
tuberculosis, and comparison with the predicted protein coding sequences of the
complete genomes of B. subtilis and E. coli. 55.6% of the M. tuberculosis proteins in
the proteome matched InterPro entries, and these could be classified according to
function. A large percentage of these are hypothetical proteins. The comparison with B.
subtilis and E. coli provided information on the most common protein families and
domains, the most highly represented families, and the representation of different
regulatory protein families in each organism.
Availability: The database is accessible for text- and sequence-based searches at
http://www.ebi.ac.uk/interpro/. The InterPro flatfile may be retrieved from the EBI
anonymous-ftp server ftp://ftp.ebi.ac.uk/pub/databases/interpro.
Introduction
Pattern databases have become vital tools for identifying distant relationships in novel
sequences and hence for inferring protein function. Currently, the most commonly-used
pattern databases include PROSITE, home of regular expressions and profiles
(Hofmann et al., 1999); Pfam, keeper of hidden Markov models (HMMs) (Bateman et
al., 2000); and PRINTS, provider of fingerprints (groups of aligned, un-weighted
motifs) (Attwood et al., 2000). These methods have evolved to address sequence
analysis problems, and provide tools for identifying sequence relationships and inferring
protein function. However, these individual resources have different areas of optimum
application owing to the different strengths and weaknesses of their underlying analysis
methods. The creation of a single coherent resource for diagnosis and documentation of
protein families is difficult, given entirely different database formats, different search
tools and different search outputs. Nevertheless, in an attempt to address some of these
issues, we have developed InterPro, and found applications for the database in
developing automatic methods for annotation of sequence data, and for whole proteome
analysis.
Source databases and methods
The first release of InterPro (Release 1.0, March 2000) was built from Pfam 5.0 (2,008
domains), PRINTS 25.0 (1,260 fingerprints) and PROSITE 16.0 (1,370 families). While
the initial InterPro release was created around PRINTS, PROSITE and Pfam, ProDom
will shortly also be included. Flat-files submitted by each of the groups were
systematically merged and dismantled. Where relevant, family annotations were
amalgamated, and all method-specific annotation separated out. This process was
complicated by the relationships that can exist, both between entries in the same
database, and between entries in different databases. Different types of parent-child
relationship were evident, leading us to recognise ‘sub-types’ and ‘sub-strings’. All
recognisably distinct entities were assigned unique accession numbers (which take the
form IPR00000). An InterPro entry contains the list of member database signatures,
HMMs, profiles or fingerprints associated with the entry and an abstract describing the
domain, repeat, family or PTM, which is derived from merged annotation of the
member databases. An entry also contains links to a tabular or graphical view of the
matches to the SWISS-PROT and TrEMBL protein sequence databases.
A comparative analysis of the predicted protein coding sequences of three complete
prokaryotic genomes, M. tuberculosis, B. subtilis and E. coli was performed by running
a non-redundant set of proteins against the InterPro database. A manual inspection of
the results of the InterPro runs was done to calculate general statistics of protein
families.
Implementation and results
We illustrate the use of InterPro in whole proteome analysis of M. tuberculosis as
shown in the graph in Figure 1.
Figure 1. A pie graph representing the coverage of InterPro protein functions in the M.
tuberculosis proteome.
An application of InterPro in the comparitive genome analysis of M. tuberculosis, B.
subtilis and E. coli is shown in Table 1 and Figure 2.
Table 1. The 10 biggest InterPro families for M. tuberculosis, a comparative view with
B. subtilis and E. coli.
InterPro
Acc. No.
IPR000084
IPR000030
IPR000379
IPR000051
IPR002198
IPR001617
IPR001647
IPR000873
IPR001051
IPR000205
InterPro Entry Name
PE family
PPE family
Esterase/lipase/thioesterase
SAM (and other nucleotide) binding motif
Short-chain dehyrogenase/reductase family
ABC transporters family
Bacterial regulatory proteins, TetR family
AMP-binding domain
ATP-binding transport protein, P-loop motif
NAD binding site
M. tub
proteins
B. subtilis
proteins
E. coli
proteins
86
66
65
53
52
42
42
41
40
34
0
0
36
27
33
81
19
24
85
22
0
0
23
30
18
78
11
9
81
30
Figure 2. Graph of the relative representation of specific protein families in M.
tuberculosis, B. subtilis and E. coli based on an InterPro analysis.
Discussion
We have developed the InterPro database, an integrated resource of protein domains and
functional sites. By uniting the databases, we have capitalised on their individual
strengths, producing a single entity that is far greater than the sum of its parts. InterPro
can streamline the analysis of newly determined sequences for the individual user, and
makes a significant contribution in the demanding task of automatic annotation of
predicted proteins from genome sequencing projects. InterPro is also likely to highlight
key areas where none of the databases has yet made a contribution and hence where the
development of some sort of pattern might be useful. It has been used here for the
comparative genome analysis of the complete proteomes of M. tuberculosis, B. subtilis
and E. coli, and has also proven its usefulness for whole proteome analysis of
Drosophila melanogaster (Rubin et al., 2000).
Acknowledgements
The InterPro Consortium: R.Apweiler 1, T.K.Attwood 4, A.Bairoch 2, A.Bateman 5,
E.Birney 1, M.Biswas 1, P.Bucher 3, L.Cerutti 5, M.D.R.Croning 1,4, R.Durbin 5,
W.Fleischmann 1, H.Hermjakob 1, N.Hulo 2, D.Kahn 6, A.Kanapin 1,
Y.Karavidopoulou 1, R.Lopez 1, B.Marx 1, N.J.Mulder 1, T.M.Oinn 1, C.J.A.Sigrist 2,
E.Zdobnov 1.
(1 EMBL Outstation – European Bioinformatics Institute, Wellcome Trust Genome
Campus, Hinxton, Cambridge, UK; 2 Swiss Institute for Bioinformatics, Geneva,
Switzerland; 3 Swiss Institute for Experimental Cancer Research, Lausanne,
Switzerland; 4 School of Biological Sciences, The University of Manchester,
Manchester, UK; 5 The Sanger Centre, Wellcome Trust Genome Campus, Hinxton,
Cambridge, UK; 6 CNRS/INRA, Toulouse, France)
The InterPro project is supported by grant number BIO4-CT98-0052 of the European
Commission. TKA is a Royal Society University Research Fellow.
References
Attwood, T.K., Croning, M.D.R., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P.,
Selley, J.N. and Wright, W. (2000) PRINTS-S: the database formerly known as
PRINTS. Nucleic Acids Res., 28, 225-227.
Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and
its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45-48.
Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L. and Sonnhammer, E.L.L.
(2000) The Pfam Protein Families Database. Nucleic Acids Res., 28, 263-266.
Corpet, F., Servant, F., Gouzy, J. and Kahn, D. (2000) ProDom and ProDom-CG: tools
for protein domain analysis and whole genome comparisons. Nucleic Acids Res., 28,
267-269.
Etzold, T, Ulyanov, A. and Argos, P. (1996) SRS: information retrieval system for
molecular biology data banks. Methods Enzymol., 266, 114-128.
Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999) The PROSITE database,
its status in 1999. Nucleic Acids Res., 27, 215-219.
Rubin, G.M. et al. (2000) Comparative genomics of the eukaryotes. Science, 287, 22042215.
Download