Towards a map of the protein universe Daniel Chubb*, Benjamin R. Jefferys, Michael J.E. Sternberg, Lawrence A. Kelley Structural Bioinformatics Group Division of Molecular Biosciences Imperial College London * Corresponding author 1 Abstract Protein sequence space is sparsely populated by the proteins present in organisms. This populated region of sequence space can be thought of as the protein universe. How near is our current sampling of the protein sequence space to providing a representative map of this universe? Our current sample is widely used to characterise the structure, function and evolutionary relationships of protein sequences, and elaboration of the map through sequencing projects is expected to improve our ability to do this. We have plotted our progress in exploring the protein universe over the last two decades. We find the rate of novel sequence discovery is in a sustained period of decline. As a consequence, we observe a plateau in our ability to detect remote evolutionary relationships using sequence analysis which relies upon the accumulation of novel sequence data. This in turn will have a negative effect upon our ability to annotate proteins using these evolutionary relationships – contrary to the widely-held assumption that more sequencing will help annotation work. We interpret this trend as signalling our approach to a representative map of sequence space and discuss its implications. Introduction Only a tiny fraction of the vast number of all possible protein sequences is populated by proteins present in existing organisms. Our knowledge of these populated islands provides us with a map of the protein universe. As sequencing projects continue to provide us with new data, the resolution of our map increases, permitting insights into protein function and evolution. Our map currently covers 1,214 published genomes (1): in comparison, there are over 1.4 million known organisms (2) while estimates of the total number of species on earth vary between 4 and 100 million (3). Even given the exponential growth in sequencing over the past two decades it would appear that our journey towards a comprehensive map of the protein universe is far from complete. Fortunately, this is a journey with a shortcut: the reality that our map does not need to be comprehensive, only representative. A substantial proportion of our current map has been shown to be composed of very similar homologous sequences (4-7) whose diversity can be captured by far fewer representative sequences. We will converge upon a representative map of the protein universe long before we have sequenced it in its entirety. Here, we investigate our progress towards such a map. The protein universe is a complex space whose inhabitants and their relationships can be considered at a variety of levels. For example, a protein can be considered in its entirety or as a collection of selffolding evolutionary units called domains (8-10). The relationship between each protein or domain can then be characterised by a number of continuous or discrete measurements, examples including functional similarity, membership of the same structural classification (for example, SCOP fold or superfamily (11)) or more commonly through sequence similarity. For the purpose of this work, we consider the protein universe to be populated by protein sequences clustered by sequence identity into islands of varying degrees of density. As more genomes are sequenced, our map of the protein universe can become more detailed either through the creation of new islands or the progressive population of existing islands. We expect the early stages of sequencing to be dominated by the discovery of novel sequence islands (Fig. 1). As we approach a representative map we expect an increasing proportion of newly determined sequences to fall within existing islands in sequence space and for the creation of new islands to become progressively less common. Often, the evolutionary relationship between proteins is distant: in our representation this would be seen as points appearing far apart within an island or even in a separate island if our map has insufficient resolution. For us to map these remote relationships between proteins, we need sophisticated homology detection tools. As we accumulate novel sequences and bridge gaps in our map of the protein universe, we expect these tools to able to detect more of these remote relationships (12, 13). 2 Others have previously investigated the contribution of newly sequenced genomes to the number of islands within our map of sequence universe. Koonin et al. observed the growth of islands whilst incrementally adding 83 genomes consisting of 311,256 proteins (14). Marsden et al. later used a larger dataset of 633,546 sequences from 203 genomes (15). Both investigations show a linear increase in islands with the addition of each genome. Both methods are based upon whole-sequence similarity and do not take account of domain combinations. The novelty of single domain and multidomain architectures was specifically investigated by Levitt (16) who performed analysis of sequence profile matches to historical sequence databases. He found that single domain sequences are growing slowly and appear to be saturating in the sequence database. Novelty in the form of the rearrangements of multi-domain architectures is growing linearly with added sequences. These results, however, rely upon the accuracy and the coverage of the curated profiles used for the analyses (17, 18). We recreated the sequence databases of the past two decades to estimate how the rate of novel sequence discovery has changed over time. In contrast to previous work we found that the rate of discovery is in a sustained period of decline, and we expect that at least 90% of new protein sequences will fall within existing protein islands by 2040. A significant proportion of the remaining 10% are likely to be the result of simple domain shuffling or are homologous to existing sequences but with a low sequence identity, both of which indicate limited novelty in terms of protein structure and function. We do not yet fully understand the complex relationship between amino acid sequence and protein three-dimensional structure and function. However, homologous proteins share a common evolutionary ancestor, adopt highly similar three-dimensional structures and often share related functions. Thus, the detection of homology provides us with a method of structure and function prediction in the absence of a full understanding of that relationship. The sequence variation observed between homologous protein sequences indicates those sequence changes that are compatible with a given structure and/or function. The more information we have regarding such acceptable mutations and their frequency, the better we can detect remote structural and functional relationships. The primary source of such information is the growing sequence database. The two most widely used methods for harnessing this information are BLAST and PSI-BLAST (19). PSI-BLAST is an iterative technique that employs the information in a sequence database to build statistical models, called profiles, of the mutational propensities of each position in a protein sequence of interest. In the first iteration, close homologs are gathered using the standard BLAST algorithm. The alignment of these homologs to a sequence of interest provides information on the amino acid substitutions observed at each position. This information permits the generation of a profile that can be used to search in the next iteration. This process can be continued, repeatedly refining the profile, until no further homologs are detected. This procedure is highly successful and detects more than twice as many homologous proteins with high confidence compared to BLAST, and has constituted the standard benchmark of remote homology detection against which new techniques are judged for the last decade. Its power stems from information in the sequence database in the form of sequence variation of homologs. We show that the decline in novel sequence discovery we have observed is reflected by a plateau in our ability to map remote homology using PSI-BLAST. Results The rate of novel island formation is in decline. We define a sequence island as a set of proteins that share more than 50% global sequence identity to the largest sequence in the island. For each year, the database was clustered into such islands using the program CD-HIT (4). The size of these clustered databases is considerably reduced (by 60% on average) and represents an estimate of the number of sequence islands for each year. Our measure is a highly conservative upper estimate for two reasons: firstly, homologous protein domains may often share far lower - less than 25% (20) sequence identity than the lowest threshold achievable by the clustering method. Secondly, the 3 clustering technique works at the level of whole proteins and so does not take into account domain combinations. Briefly, this means that two multi-domain proteins that share one or more domains are placed in different clusters or islands if their domains are in a different order or there is at least one domain that they do not share. Thus homologous domains will often fall into separate clusters. It has been previously shown that a large proportion of protein novelty is seen in the arrangement of domains in multi-domain proteins (16), this novelty will therefore be seen in our analysis as new islands where the combination of domains isn’t already present in the database. Our recreated databases show the classic exponential growth that has been noted by others, expanding from approximately 5000 sequences in 1987 to approximately 4 million in 2007 (Fig. 2a). In each database, the number of sequence islands steadily increases but is far smaller than the total number of sequences. This agrees with previous observations that sequence databases contain a substantial amount of redundancy in the form of closely related homologs with high sequence similarity (5-7). While previous observations of redundant information are focused on a single static database, our analysis is far broader. By comparing the size of the clustered and unclustered databases, we have calculated how the rate of novel island discovery is changing over time (Fig. 2a and 2b). Up to 1995, the number of islands per sequence increases each year. After this point, however, the rate is in constant decline. In 1995 the number of islands was approximately half of the total number of sequences, by 2007 this figure has fallen to just over a third. An initial explanation for this trend is bias in sequencing projects towards highly similar organisms. To investigate this effect we extended the analysis using metagenomic data derived from environmental sequencing. Metagenomic projects sample a substantial diversity of extant sequences across habitats and will be far less prone to systematic bias. We calculated the novel sequence island contribution made by merging the 2007 database and the Global Ocean Survey (21) data as provided by the UniProt UniMES database at the conservative 50% threshold with CD-HIT (Fig. 2a and 2b). We observed an even steeper decline in the rate of novel island discovery than seen to date. In addition, the metagenomic dataset is likely to contain a substantial number of artefactual sequences (22), which will artificially increase the total number of sequences and islands leading to an overestimate of the rate of novel island discovery. Prediction for future island growth. Our results show an increasing proportion of newly determined sequences falling within existing islands, which we believe to be indicative of an approach to the representative map of the protein universe. If this trend continues, we predict that by approximately 2040, at least 90% of new sequences will fall within an existing island (Fig. 3). This does not imply that the remaining 10% are entirely novel. Recall that this is a conservative estimate because the analysis does not cluster sequences with less than 50% sequence identity, or those which are simple rearrangements of the same set of domains. For this reason, in reality, fewer than 10% of sequences will add novelty to our map of the protein universe. Lack of novelty affects homology detection. This trend has important implications for one of the primary roles of protein sequence databases: their use in homology detection for the purpose of characterising the structure, function and evolutionary relationships of a protein. Increases in our ability to detect remote homology with methods such as PSI-BLAST currently rely on the steady discovery of novel sequence information. If, as indicated above, this rate of novel sequence discovery is indeed declining, then we would expect this to be reflected in a slowing in improvements in the detection of remote homology, which in turn has implications across the biological sciences. To investigate this we examined the performance of PSI-BLAST on the sequence databases of the past twenty years. We asked whether there has been an improvement over time in our ability to identify remote homologs in a set of 6,982 proteins from the Structural Classification of Proteins (SCOP) (11) sharing 4 less than 30% sequence identity (SCOP30). SCOP is a database of protein structural domains curated by experts that provides an authoritative classification of remote homologs on the basis of sequence and known structure, exploiting the fact that structure is more conserved than sequence in evolution. PSI-BLAST profiles were created for each of these sequences by scanning against each database from 1987 to 2007. Each sequence profile was then used to search the SCOP30 version 1.73 sequence database and the number of detected homologs was recorded (see Methods). In contrast to the CD-HIT clustering method, PSI-BLAST is able to routinely identify relationships with less than 25% sequence identity and as it operates on local alignments, is thus capable of considering independent domains. As expected, in the early phases of database growth we see a steady rise in our ability to detect remote homology (Fig. 2c). However, this improvement does not directly scale with the massive increase in available sequences particularly evident in the last decade. Even more striking, homology detection plateaus in 2004 and subsequently shows a slight decline. The same method was applied to the combined 2007 UniProt and UniMES metagenomic databases (Fig. 2 – rightmost sample) and substantially fewer homologs were detected from SCOP30. This is likely to be due to the large number of hypothetical sequences and sequence fragments within metagenomic datasets, which have previously been shown to adversely affect the quality of PSIBLAST profiles (24). The cutting edge of remote homology detection is based on matching hidden Markov models and achieves substantially superior performance to PSI-BLAST. We performed the same historical analysis with one of the leading examples of such a program, HHsearch (25). While not plateauing, performance improvements over time are slowing, and this slowing is particularly evident when compared to the scale of sequence discovery (Fig. 2c). It is also worth noting that although HHsearch performs far better than PSI-BLAST it is impractical for it to be used on full sequence databases as it would require an all-vs all search of the sequence database to build models for every sequence. It is restricted to smaller databases such as those containing sequences of known structures or individual genomes. This highlights one problem with the increase in size shown in sequence databases: the increasing impossibility of certain analyses, including all-vs-all searches (26). Some of the issues in dealing with large databases with high redundancy have been previously discussed (7) and are hinted at by the slight decline seen in the PSIBLAST homology hits for 2007. When the PSI-BLAST and HHSearch protocols were re-run on a database consisting of sequence representatives from our islands (see Supporting Information Fig. 4), we found that the reduced information performed better in the majority of cases, especially in the later databases with more redundancy. These results indicate that the idea of a representative – as opposed to global – map of the protein universe is a realistic and possibly important goal. Discussion Our map of the protein universe is reliant on the sequences available to us from various sequencing projects. As we gain novel data we expect our resolution of this map to increase and eventually become a global map fully representing the space inhabited by all existing sequences. Obtaining a representative map will be achievable long before we have sequenced all of Earth’s biodiversity because of the similarities inherent in proteins due to sharing a common evolutionary origin. These similarities mean that proteins will exist in islands, the distribution, shape and density of which reflect their relationship to each other. When we reach a point where any new sequence can be placed in a pre-existing island we will have obtained the representative map. In our study we investigated our progress towards such a map over the last twenty years of sequence data acquisition. 5 The method for clustering sequences into islands was chosen due to its speed and the large amount of data analysed. However, for reasons given in the Results section, this choice means our sequence islands are a conservative measure of sequence novelty. Previous work (14) (15) showing a linear growth in islands with added genomes suffered from similar limitations, which also makes their estimations conservative. Both studies used a small fraction of our dataset, approximately 1999 and 2001 respectively. Methods for remote homology detection such as PSI-BLAST are able to routinely identify relationships with sequence identity less than 25%. Being able to detect these remote relationships is vital to our ability to map the protein universe and predict the function and structure of proteins. The PSI-BLAST paper (19) has over 30,000 citations in the literature making it the most highly-cited of the past decade. This reflects the crucial role played by remote homology detection for the accurate inference of the relationships between protein sequence, structure, function and evolution. It has been assumed (12, 13) that the continued growth of the sequence database, even in the absence of novel algorithm development, will bring with it a steady improvement in our ability to detect remote homology. One of the sources of information which powerful methods of remote homology detection rely on is bridging sequences that connect distantly related protein families. If, as it appears, we are approaching a representative picture of the global sequence map, then the rate of discovery of such bridging sequences will progressively decline whilst the vast majority of sequences will fall into pre-existing clusters. There are obvious parallels between these elusive bridging sequences and the ‘missing links’ in palaeontology. As with the missing links, many of these bridging sequences will be transitional forms, or present in extinct lineages that will never be sequenced. Although sequence space is continuous, its population by evolution is not. That a representative protein sequence map may nearly be in our grasp may not be wholly surprising. In terms of protein three-dimensional structure, such a map appears to be near completion. Fold space, the space of distinct three-dimensional protein topologies, appears to be populated by a relatively small number of protein folds, variously estimated at between 1,000 and 10,000 (28). Although there is some debate regarding the discreteness or continuity of fold space, it is clear that the majority of protein structures fall within a limited range of fold islands (29). With the aid of structural genomics initiatives the number of experimentally determined protein structures has been growing rapidly while the rate of novel fold discovery is slowing considerably (16). This indicates our image of fold space is changing ever more slowly and that we are approaching a full representation of the protein structural repertoire (30). Two primary factors have governed our progress to date in remote homology detection and the insights it generates into the relationships between protein sequence, structure, function and evolution: novel algorithm development and the growth in available sequence information. In light of the evidence presented here, we have reason to expect a diminishing role for the latter sooner rather than later. If we wish to further our understanding of evolution by connecting the branches of the tree of life, we will require both sustained development of new and more powerful algorithms for searching for homology and a greater reliance on different sources of experimental data, such as the structures being provided by structural genomics initiatives. In the face of data lost in evolutionary time such as transient bridging sequences that connect the evolutionary map, it may now be timely to focus attention on attempts to generate these missing links artificially. Some groups have created extra diversity in the sequence databases by creating artificial sequences using multiple sequence alignments and a set of structural rules (31) or using phylogenies as a guide to re-create ancestral sequences (32). When these sequences are added to databases, an improvement is seen in remote homology detection. It is of course important to recognise that the findings reported here could be modified by the sequencing of radically different organisms to those already analysed. Unarguably, sequencing projects demonstrate some degree of bias in their choice of organism to sequence (33). However, metagenomics has little or no such bias and yet demonstrates no improvements in the rate of 6 discovery of novel sequence islands. Although the number of unique protein domain sequences is vast, it is nonetheless finite. It is inevitable that at some point the discovery of truly novel sequences will become an extremely rare event and eventually all but cease. The surprising indications from this work are how close we already appear to be to this representative map of the protein universe. Materials and Methods The entire analysis presented here took approximately 10 CPU years. Recreation of past databases. The UniProt (34) databases from 1987 to 2007 were recreated using the January 2008 UniProt_trembl.dat and UniProt_trembl.fasta files (available from the UniProt FTP site). Sequences were added to each database if they were found to have existed in UniProt before a given date according to the DT line within the UniProt_trembl.dat file. PSI-BLAST searchable binary files were created for each of the new databases using formatdb. Metagenomic dataset. Metagenomic sequences were downloaded from the UniProt Metagenomic and Environmental Sequences database (UniMES) (35). UniMES contains data from the Global Ocean Sampling Expedition (GOS) (21). The downloaded fasta file is non-redundant to a 100% threshold and contains approximately 6 million predicted sequences. A database was created which contained the full 2007 sequence data + this metagenomic dataset. A 50% non-redundant version of this database was also created (see below). Sequence clustering using CD-HIT. CD-HIT (4) is a program that clusters sequence databases according to a sequence identity threshold using a short word filtering heuristic. Representative sequences are selected from each cluster and are used to form a new sequence database. CD-HIT is the standard tool for creating representative databases and has been used by UniProt to create their UniRef (6) reduced redundancy databases. CD-HIT uses greedy incremental clustering. First, a sequence database is sorted according to sequence length and the longest sequence is chosen as the representative of the first cluster. Every other remaining sequence is then compared to the cluster representative and added to the cluster if the similarity is above a certain threshold (50% in our study). The next longest remaining sequence is then selected as a representative of a new cluster and the process continues until all sequences are assigned a cluster. CD-HIT was run on databases for every other year from 1987 and at a relatively high threshold of 50% sequence identity, due to the high level of computer resources required (100 CPU weeks for this data). In addition, a combined 2007 + UniMES (metagenomic sequence) database was created and CD-HIT was run on this database of just over 10 million sequences at a threshold of 50%. For each processed database, a new database was produced, consisting of the representative sequence from each cluster. The size of each of these representative databases provided our measure for the number of islands. Prediction of future growth is sequence islands. The number of new islands per new sequence was calculated for each year between 1987 and 2007. A power-law curve was then fitted to this data and extended until 2050. Creating the SCOP30 test set. SCOP30, containing SCOP version 1.73 (11) sequences which share no more than 30% global sequence similarity, was downloaded from ASTRAL (36) (http://astral.berkeley.edu). These sequences were placed into homologous groups according to superfamily membership defined within the SCOP domain classification. A PSI-BLAST searchable binary SCOP30 file was created using formatdb. Construction of database specific PSI-BLAST profiles. Each sequence within the SCOP30 test set was searched against each of the recreated UniProt databases using four iterations of PSI-BLAST with an inclusion threshold of 10-3. At the end of the fourth iteration a checkpoint file and PSSM was output. 7 To subsequently test the robustness of our result, PSI-BLAST profiles were also output from 2,3 and 5 iterations. The results of which are shown in Supporting Figure 1. For the result with the consistently highest homologs detected (4 iterations), another run was made with a more stringent inclusion threshold of 10-6 (Supporting Fig. 2). The same trend was observed under all these conditions. Identification of remote homologies within the SCOP30 test set. Each SCOP30 sequence was searched against the entire SCOP30 database using the profiles output from the previous searches of the recreated sequence databases. This was accomplished by initiating a single iteration of PSIBLAST, restarting using the checkpoint files previously created. The number of SCOP30 sequences belonging to the same superfamily as the query below an e-value threshold of 0.1 was recorded. The proportion of predictions that were false positives at this threshold varied between 3.3 and 6.9% with a mean of 5.1%. Randomised databases analysis. It is possible that the trend identified by PSI-BLAST is a result of the order of discovery of sequences. For example, later databases might contain certain pathological sequences, adversely effecting homology detection. To test this, the datasets were recreated with randomized orders of discovery, whilst fixing the number of sequences for a given year. The same trend was observed (Supporting Fig. 3). HHsearch. HHsearch (25) is a highly sensitive sequence alignment method based on the pairwise comparison of profile Hidden Markov Models (HMMs). An HMM was created for each sequence in the SCOP30 test set using each UNIPROT database (1987-2007). The PSI-BLAST parameters used by HHsearch in the creation of the HMMs were identical to those used in the PSI-BLAST runs (4 iterations with an inclusion threshold of 0.001). No secondary structure information was used. An all against all search of the HMMs was conducted for each year and all hits within the same superfamily with a confidence (HHsearch score) greater than 95% were recorded. The proportion of predictions that were false positives at this threshold varied between 0.2% and 4.6% with a mean of 2.6%. Homology detection using representative databases. The same PSIBLAST and HHsearch procedures were applied to the sequence databases formed from the island representatives that were created by CD-HIT (Supporting Fig 4). Acknowledgements D.C., B.R.J. and L.A.K. are supported by the Biotechnology and Biological Sciences Research Council. Author contributions D.C. designed and wrote all code, performed the analysis and wrote the paper. B.J. supervised the main analysis, prepared the figure, and contributed additional analyses. L.A.K conceived the study, supervised the analysis and wrote the paper. M.J.E.S. supervised the work and contributed to its interpretation. Figure Legends Figure 1. Cartoon illustration of the progressive population of our sequence map. Early stages are characterised by the creation of new sequence islands. In contrast, later stages are characterised by the population of existing islands leading to an asymptotic approach to the complete protein map. Figure 2. Plots (a) to (c) show three different views of the change in sequence space over the last two decades. All three plots use the same horizontal axis. 2007 plus GOS indicates the combination of the 2007 sequence database with the Global Ocean Survey metagenomics data. (a) The protein sequence database (black bars) has grown exponentially over the past two decades. A much smaller increase is seen in the number of sequence islands (red bars) at a level of 50% sequence identity. This is 8 particularly the case with the metagenomic data (striped) which appears to have high redundancy. Note that the vertical axis uses a different scale for the top and bottom halves, in order to show growth both in the early sequence databases and the more recent ones. (b) The black line indicates the ratio of the number of islands to the total number of sequences, representing the rate of novel island discovery. Until 1995 this rate was steady or growing. Since that time this rate has been falling. There is a sharp drop on the addition of metagenomic data. (b) This plot shows the change in the ability of PSI-BLAST (blue line) and HHsearch (red line) to detect homology using profiles built from the databases of each year. It is clear that the more computationally intense HHsearch detects more homologs in the majority of cases. Although there is a general improvement in both methods over time, this improvement slows and in the case of PSI-BLAST, it plateaus and even begins to decline. The addition of metagenomic data adversely affects both methods. Figure 3. A prediction of future island growth is made by fitting a power-law curve to the number of new islands per new sequence in each year from 1987-2007. The GOS metagenomic data is not included in our projection. We predict that by approximately 2040, 90% of new sequences will fit within an existing island. Figures Figure 1 9 Figure 2 Figure 3 10 References 1. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC (2007) The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucl. Acids Res.:gkm884. 2. Leipe DD (1996) Biodiversity, genomes, and DNA sequence databases. Current Opinion in Genetics & Development 6:686-691. 3. Crandall KA, Buhay JE (2004) EVOLUTION: Genomic Databases and the Tree of Life. Science 306:1144-1145. 4. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658-9. 5. Park J, Holm L, Heger A, Chothia C (2000) RSDB: representative protein sequence databases have high information content. Bioinformatics 16:458-64. 6. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23:1282-1288. 7. Li W, Jaroszewski L, Godzik A (2002) Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng 15:643-9. 8. Chothia C, Gough J, Vogel C, Teichmann SA (2003) Evolution of the protein repertoire. Science 300:1701-1703. 9. Apic G, Gough J, Teichmann SA (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol 310:311-325. 10. Todd AE, Orengo CA, Thornton JM (2001) Evolution of function in protein superfamilies, from a structural perspective. Journal of Molecular Biology 307:11131143. 11. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536-40. 12. Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25:1761-1767. 13. Sandhya S, Kishore S, Sowdhamini R, Srinivasan N (2003) Effective detection of remote homologues by searching in sequence dataset of a protein domain fold. FEBS Letters 552:225-230. 14. Kunin V, Cases I, Enright A, de Lorenzo V, Ouzounis C (2003) Myriads of protein families, and still counting. Genome Biology 4:401. 11 15. Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA (2006) Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucl. Acids Res. 34:1066-1080. 16. Levitt M (2009) Nature of the protein universe. Proc. Natl. Acad. Sci. U.S.A 106:1107911084. 17. Finn RD et al. (2010) The Pfam protein families database. Nucl. Acids Res. 38:D211-222. 18. Geer LY, Domrachev M, Lipman DJ, Bryant SH (2002) CDART: Protein Homology by Domain Architecture. Genome Research 12:1619-1623. 19. Altschul S et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25:3389-3402. 20. Pearson WR, Sierk ML (2005) The limits of protein sequence comparison? Curr. Opin. Struct. Biol 15:254-260. 21. Yooseph S et al. (2007) The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 5:e16. 22. Li W, Wooley JC, Godzik A (2008) Probing metagenomics by rapid cluster analysis of very large datasets. PLoS ONE 3:e3375. 23. Ostell J (2005) Databases of Discovery. Queue 3:40-48. 24. Tress ML, Cozzetto D, Tramontano A, Valencia A (2006) An analysis of the Sargasso Sea resource and the consequences for database composition. BMC Bioinformatics 7:213. 25. Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951-60. 26. Hutchison CA (2007) DNA sequencing: bench to bedside and beyond. Nucl. Acids Res.:gkm688. 27. Thornton JM, Orengo CA, Todd AE, Pearl FMG (1999) Protein folds, functions and evolution. Journal of Molecular Biology 293:333-342. 28. Wolf YI, Grishin NV, Koonin EV (2000) Estimating the number of protein folds and families from complete genome data. Journal of Molecular Biology 299:897-905. 29. Sadreyev RI, Kim B, Grishin NV (2009) Discrete-continuous duality of protein structure space. Curr. Opin. Struct. Biol 19:321-328. 30. Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J (2006) On the origin and highly likely completeness of single-domain protein structures. Proceedings of the National Academy of Sciences of the United States of America 103:2605-2610. 31. Pei J, Dokholyan NV, Shakhnovich EI, Grishin NV (2003) Using protein design for 12 homology detection and active site searches. Proc. Natl. Acad. Sci. U.S.A 100:1136111366. 32. Cai W, Pei J, Grishin NV (2004) Reconstruction of ancestral protein sequences and its applications. BMC Evol Biol. 4:33. 33. Kyrpides NC (2009) Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream. Nat Biotech 27:627-632. 34. Wu CH et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34:D187-91. 35. Consortium TU (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 37:D169–D174. 36. Brenner SE, Koehl P, Levitt M (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucl. Acids Res. 28:254-256. 13