INTERPRO AS A NEW TOOL FOR WHOLE GENOME ANALYSIS. A COMPARITIVE ANALYSIS OF MYCOBACTERIUM TUBERCULOSIS, BACILLUS SUBTILIS AND ESCHERICHIA COLI AS A CASE STUDY. Apweiler R.*, Fleischmann W., Mulder N.J. An expanded abstract for the Proceedings of the Second International Conference on Bioinformatics of Genome Regulation and Structure. EMBL Outstation – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB 10 1SD, United Kingdom To whom correspondence should be addressed. Tel: +44 (0)1223 494 435 Fax: +44 (0)1223 494 468 Email: rolf.apweiler@ebi.ac.uk Key words: database, protein, domain, function, family, repeat, computer tool Motivation: Several pattern-recognition methods have evolved to address different protein sequence analysis problems, resulting in rather different and mostly independent databases. InterPro was developed as a new integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. InterPro has applications in computational functional classification of newly determined sequences lacking biochemical characterisation, and in comparative genome analysis. Results: The first release of InterPro was built from Pfam 5.0, PRINTS 25.0 and PROSITE 16.0. and contains nearly 3000 entries, representing families, domains, repeats and PTMs encoded by 4,879 different regular expressions, profiles, fingerprints and HMMs. Overall, InterPro entries match more than 300,000 sequences in SWISS-PROT and TrEMBL. This new resource provides an integrated view of the pattern databases, and provides an intuitive interface for text- and sequence-based searches. InterPro was used for whole proteome analysis of the pathogenic microorganism, M. tuberculosis, and comparison with the predicted protein coding sequences of the complete genomes of B. subtilis and E. coli. 55.6% of the M. tuberculosis proteins in the proteome matched InterPro entries, and these could be classified according to function. A large percentage of these are hypothetical proteins. The comparison with B. subtilis and E. coli provided information on the most common protein families and domains, the most highly represented families, and the representation of different regulatory protein families in each organism. Availability: The database is accessible for text- and sequence-based searches at http://www.ebi.ac.uk/interpro/. The InterPro flatfile may be retrieved from the EBI anonymous-ftp server ftp://ftp.ebi.ac.uk/pub/databases/interpro. Introduction Pattern databases have become vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. Currently, the most commonly-used pattern databases include PROSITE, home of regular expressions and profiles (Hofmann et al., 1999); Pfam, keeper of hidden Markov models (HMMs) (Bateman et al., 2000); and PRINTS, provider of fingerprints (groups of aligned, un-weighted motifs) (Attwood et al., 2000). These methods have evolved to address sequence analysis problems, and provide tools for identifying sequence relationships and inferring protein function. However, these individual resources have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods. The creation of a single coherent resource for diagnosis and documentation of protein families is difficult, given entirely different database formats, different search tools and different search outputs. Nevertheless, in an attempt to address some of these issues, we have developed InterPro, and found applications for the database in developing automatic methods for annotation of sequence data, and for whole proteome analysis. Source databases and methods The first release of InterPro (Release 1.0, March 2000) was built from Pfam 5.0 (2,008 domains), PRINTS 25.0 (1,260 fingerprints) and PROSITE 16.0 (1,370 families). While the initial InterPro release was created around PRINTS, PROSITE and Pfam, ProDom will shortly also be included. Flat-files submitted by each of the groups were systematically merged and dismantled. Where relevant, family annotations were amalgamated, and all method-specific annotation separated out. This process was complicated by the relationships that can exist, both between entries in the same database, and between entries in different databases. Different types of parent-child relationship were evident, leading us to recognise ‘sub-types’ and ‘sub-strings’. All recognisably distinct entities were assigned unique accession numbers (which take the form IPR00000). An InterPro entry contains the list of member database signatures, HMMs, profiles or fingerprints associated with the entry and an abstract describing the domain, repeat, family or PTM, which is derived from merged annotation of the member databases. An entry also contains links to a tabular or graphical view of the matches to the SWISS-PROT and TrEMBL protein sequence databases. A comparative analysis of the predicted protein coding sequences of three complete prokaryotic genomes, M. tuberculosis, B. subtilis and E. coli was performed by running a non-redundant set of proteins against the InterPro database. A manual inspection of the results of the InterPro runs was done to calculate general statistics of protein families. Implementation and results We illustrate the use of InterPro in whole proteome analysis of M. tuberculosis as shown in the graph in Figure 1. Figure 1. A pie graph representing the coverage of InterPro protein functions in the M. tuberculosis proteome. An application of InterPro in the comparitive genome analysis of M. tuberculosis, B. subtilis and E. coli is shown in Table 1 and Figure 2. Table 1. The 10 biggest InterPro families for M. tuberculosis, a comparative view with B. subtilis and E. coli. InterPro Acc. No. IPR000084 IPR000030 IPR000379 IPR000051 IPR002198 IPR001617 IPR001647 IPR000873 IPR001051 IPR000205 InterPro Entry Name PE family PPE family Esterase/lipase/thioesterase SAM (and other nucleotide) binding motif Short-chain dehyrogenase/reductase family ABC transporters family Bacterial regulatory proteins, TetR family AMP-binding domain ATP-binding transport protein, P-loop motif NAD binding site M. tub proteins B. subtilis proteins E. coli proteins 86 66 65 53 52 42 42 41 40 34 0 0 36 27 33 81 19 24 85 22 0 0 23 30 18 78 11 9 81 30 Figure 2. Graph of the relative representation of specific protein families in M. tuberculosis, B. subtilis and E. coli based on an InterPro analysis. Discussion We have developed the InterPro database, an integrated resource of protein domains and functional sites. By uniting the databases, we have capitalised on their individual strengths, producing a single entity that is far greater than the sum of its parts. InterPro can streamline the analysis of newly determined sequences for the individual user, and makes a significant contribution in the demanding task of automatic annotation of predicted proteins from genome sequencing projects. InterPro is also likely to highlight key areas where none of the databases has yet made a contribution and hence where the development of some sort of pattern might be useful. It has been used here for the comparative genome analysis of the complete proteomes of M. tuberculosis, B. subtilis and E. coli, and has also proven its usefulness for whole proteome analysis of Drosophila melanogaster (Rubin et al., 2000). Acknowledgements The InterPro Consortium: R.Apweiler 1, T.K.Attwood 4, A.Bairoch 2, A.Bateman 5, E.Birney 1, M.Biswas 1, P.Bucher 3, L.Cerutti 5, M.D.R.Croning 1,4, R.Durbin 5, W.Fleischmann 1, H.Hermjakob 1, N.Hulo 2, D.Kahn 6, A.Kanapin 1, Y.Karavidopoulou 1, R.Lopez 1, B.Marx 1, N.J.Mulder 1, T.M.Oinn 1, C.J.A.Sigrist 2, E.Zdobnov 1. (1 EMBL Outstation – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK; 2 Swiss Institute for Bioinformatics, Geneva, Switzerland; 3 Swiss Institute for Experimental Cancer Research, Lausanne, Switzerland; 4 School of Biological Sciences, The University of Manchester, Manchester, UK; 5 The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK; 6 CNRS/INRA, Toulouse, France) The InterPro project is supported by grant number BIO4-CT98-0052 of the European Commission. TKA is a Royal Society University Research Fellow. References Attwood, T.K., Croning, M.D.R., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P., Selley, J.N. and Wright, W. (2000) PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res., 28, 225-227. Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45-48. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L. and Sonnhammer, E.L.L. (2000) The Pfam Protein Families Database. Nucleic Acids Res., 28, 263-266. Corpet, F., Servant, F., Gouzy, J. and Kahn, D. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res., 28, 267-269. Etzold, T, Ulyanov, A. and Argos, P. (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol., 266, 114-128. Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res., 27, 215-219. Rubin, G.M. et al. (2000) Comparative genomics of the eukaryotes. Science, 287, 22042215.