PNAS_Ben_2 - Structural Bioinformatics Group

advertisement
Towards a map of the protein universe
Daniel Chubb*, Benjamin R. Jefferys, Michael J.E. Sternberg, Lawrence A. Kelley
Structural Bioinformatics Group
Division of Molecular Biosciences
Imperial College London
* Corresponding author
1
Abstract
Protein sequence space is sparsely populated by the proteins present in organisms. This
populated region of sequence space can be thought of as the protein universe. How near is
our current sampling of the protein sequence space to providing a representative map of this
universe? Our current sample is widely used to characterise the structure, function and
evolutionary relationships of protein sequences, and elaboration of the map through
sequencing projects is expected to improve our ability to do this. We have plotted our
progress in exploring the protein universe over the last two decades. We find the rate of
novel sequence discovery is in a sustained period of decline. As a consequence, we observe
a plateau in our ability to detect remote evolutionary relationships using sequence analysis
which relies upon the accumulation of novel sequence data. This in turn will have a negative
effect upon our ability to annotate proteins using these evolutionary relationships – contrary
to the widely-held assumption that more sequencing will help annotation work. We interpret
this trend as signalling our approach to a representative map of sequence space and
discuss its implications.
Introduction
Only a tiny fraction of the vast number of all possible protein sequences is populated by proteins
present in existing organisms. Our knowledge of these populated islands provides us with a map of
the protein universe. As sequencing projects continue to provide us with new data, the resolution of
our map increases, permitting insights into protein function and evolution. Our map currently covers
1,214 published genomes (1): in comparison, there are over 1.4 million known organisms (2) while
estimates of the total number of species on earth vary between 4 and 100 million (3). Even given the
exponential growth in sequencing over the past two decades it would appear that our journey towards
a comprehensive map of the protein universe is far from complete. Fortunately, this is a journey with
a shortcut: the reality that our map does not need to be comprehensive, only representative. A
substantial proportion of our current map has been shown to be composed of very similar homologous
sequences (4-7) whose diversity can be captured by far fewer representative sequences. We will
converge upon a representative map of the protein universe long before we have sequenced it in its
entirety. Here, we investigate our progress towards such a map.
The protein universe is a complex space whose inhabitants and their relationships can be considered at
a variety of levels. For example, a protein can be considered in its entirety or as a collection of selffolding evolutionary units called domains (8-10). The relationship between each protein or domain
can then be characterised by a number of continuous or discrete measurements, examples including
functional similarity, membership of the same structural classification (for example, SCOP fold or
superfamily (11)) or more commonly through sequence similarity. For the purpose of this work, we
consider the protein universe to be populated by protein sequences clustered by sequence identity into
islands of varying degrees of density. As more genomes are sequenced, our map of the protein
universe can become more detailed either through the creation of new islands or the progressive
population of existing islands. We expect the early stages of sequencing to be dominated by the
discovery of novel sequence islands (Fig. 1). As we approach a representative map we expect an
increasing proportion of newly determined sequences to fall within existing islands in sequence space
and for the creation of new islands to become progressively less common. Often, the evolutionary
relationship between proteins is distant: in our representation this would be seen as points appearing
far apart within an island or even in a separate island if our map has insufficient resolution. For us to
map these remote relationships between proteins, we need sophisticated homology detection tools. As
we accumulate novel sequences and bridge gaps in our map of the protein universe, we expect these
tools to able to detect more of these remote relationships (12, 13).
2
Others have previously investigated the contribution of newly sequenced genomes to the number of
islands within our map of sequence universe. Koonin et al. observed the growth of islands whilst
incrementally adding 83 genomes consisting of 311,256 proteins (14). Marsden et al. later used a
larger dataset of 633,546 sequences from 203 genomes (15). Both investigations show a linear
increase in islands with the addition of each genome. Both methods are based upon whole-sequence
similarity and do not take account of domain combinations. The novelty of single domain and multidomain architectures was specifically investigated by Levitt (16) who performed analysis of sequence
profile matches to historical sequence databases. He found that single domain sequences are growing
slowly and appear to be saturating in the sequence database. Novelty in the form of the rearrangements of multi-domain architectures is growing linearly with added sequences. These results,
however, rely upon the accuracy and the coverage of the curated profiles used for the analyses (17,
18).
We recreated the sequence databases of the past two decades to estimate how the rate of novel
sequence discovery has changed over time. In contrast to previous work we found that the rate of
discovery is in a sustained period of decline, and we expect that at least 90% of new protein sequences
will fall within existing protein islands by 2040. A significant proportion of the remaining 10% are
likely to be the result of simple domain shuffling or are homologous to existing sequences but with a
low sequence identity, both of which indicate limited novelty in terms of protein structure and
function.
We do not yet fully understand the complex relationship between amino acid sequence and protein
three-dimensional structure and function. However, homologous proteins share a common
evolutionary ancestor, adopt highly similar three-dimensional structures and often share related
functions. Thus, the detection of homology provides us with a method of structure and function
prediction in the absence of a full understanding of that relationship. The sequence variation observed
between homologous protein sequences indicates those sequence changes that are compatible with a
given structure and/or function. The more information we have regarding such acceptable mutations
and their frequency, the better we can detect remote structural and functional relationships. The
primary source of such information is the growing sequence database. The two most widely used
methods for harnessing this information are BLAST and PSI-BLAST (19).
PSI-BLAST is an iterative technique that employs the information in a sequence database to build
statistical models, called profiles, of the mutational propensities of each position in a protein sequence
of interest. In the first iteration, close homologs are gathered using the standard BLAST algorithm.
The alignment of these homologs to a sequence of interest provides information on the amino acid
substitutions observed at each position. This information permits the generation of a profile that can
be used to search in the next iteration. This process can be continued, repeatedly refining the profile,
until no further homologs are detected. This procedure is highly successful and detects more than
twice as many homologous proteins with high confidence compared to BLAST, and has constituted
the standard benchmark of remote homology detection against which new techniques are judged for
the last decade. Its power stems from information in the sequence database in the form of sequence
variation of homologs. We show that the decline in novel sequence discovery we have observed is
reflected by a plateau in our ability to map remote homology using PSI-BLAST.
Results
The rate of novel island formation is in decline. We define a sequence island as a set of
proteins that share more than 50% global sequence identity to the largest sequence in the island. For
each year, the database was clustered into such islands using the program CD-HIT (4). The size of
these clustered databases is considerably reduced (by 60% on average) and represents an estimate of
the number of sequence islands for each year. Our measure is a highly conservative upper estimate for
two reasons: firstly, homologous protein domains may often share far lower - less than 25% (20) sequence identity than the lowest threshold achievable by the clustering method. Secondly, the
3
clustering technique works at the level of whole proteins and so does not take into account domain
combinations. Briefly, this means that two multi-domain proteins that share one or more domains are
placed in different clusters or islands if their domains are in a different order or there is at least one
domain that they do not share. Thus homologous domains will often fall into separate clusters. It has
been previously shown that a large proportion of protein novelty is seen in the arrangement of
domains in multi-domain proteins (16), this novelty will therefore be seen in our analysis as new
islands where the combination of domains isn’t already present in the database.
Our recreated databases show the classic exponential growth that has been noted by others, expanding
from approximately 5000 sequences in 1987 to approximately 4 million in 2007 (Fig. 2a). In each
database, the number of sequence islands steadily increases but is far smaller than the total number of
sequences. This agrees with previous observations that sequence databases contain a substantial
amount of redundancy in the form of closely related homologs with high sequence similarity (5-7).
While previous observations of redundant information are focused on a single static database, our
analysis is far broader. By comparing the size of the clustered and unclustered databases, we have
calculated how the rate of novel island discovery is changing over time (Fig. 2a and 2b). Up to 1995,
the number of islands per sequence increases each year. After this point, however, the rate is in
constant decline. In 1995 the number of islands was approximately half of the total number of
sequences, by 2007 this figure has fallen to just over a third.
An initial explanation for this trend is bias in sequencing projects towards highly similar organisms.
To investigate this effect we extended the analysis using metagenomic data derived from
environmental sequencing. Metagenomic projects sample a substantial diversity of extant sequences
across habitats and will be far less prone to systematic bias. We calculated the novel sequence island
contribution made by merging the 2007 database and the Global Ocean Survey (21) data as provided
by the UniProt UniMES database at the conservative 50% threshold with CD-HIT (Fig. 2a and 2b).
We observed an even steeper decline in the rate of novel island discovery than seen to date. In
addition, the metagenomic dataset is likely to contain a substantial number of artefactual sequences
(22), which will artificially increase the total number of sequences and islands leading to an
overestimate of the rate of novel island discovery.
Prediction for future island growth. Our results show an increasing proportion of newly
determined sequences falling within existing islands, which we believe to be indicative of an
approach to the representative map of the protein universe. If this trend continues, we
predict that by approximately 2040, at least 90% of new sequences will fall within an
existing island (Fig. 3). This does not imply that the remaining 10% are entirely novel. Recall
that this is a conservative estimate because the analysis does not cluster sequences with
less than 50% sequence identity, or those which are simple rearrangements of the same set
of domains. For this reason, in reality, fewer than 10% of sequences will add novelty to our
map of the protein universe.
Lack of novelty affects homology detection. This trend has important implications for one of
the primary roles of protein sequence databases: their use in homology detection for the purpose of
characterising the structure, function and evolutionary relationships of a protein.
Increases in our ability to detect remote homology with methods such as PSI-BLAST currently rely
on the steady discovery of novel sequence information. If, as indicated above, this rate of novel
sequence discovery is indeed declining, then we would expect this to be reflected in a slowing in
improvements in the detection of remote homology, which in turn has implications across the
biological sciences. To investigate this we examined the performance of PSI-BLAST on the sequence
databases of the past twenty years.
We asked whether there has been an improvement over time in our ability to identify remote
homologs in a set of 6,982 proteins from the Structural Classification of Proteins (SCOP) (11) sharing
4
less than 30% sequence identity (SCOP30). SCOP is a database of protein structural domains curated
by experts that provides an authoritative classification of remote homologs on the basis of sequence
and known structure, exploiting the fact that structure is more conserved than sequence in evolution.
PSI-BLAST profiles were created for each of these sequences by scanning against each database from
1987 to 2007. Each sequence profile was then used to search the SCOP30 version 1.73 sequence
database and the number of detected homologs was recorded (see Methods).
In contrast to the CD-HIT clustering method, PSI-BLAST is able to routinely identify relationships
with less than 25% sequence identity and as it operates on local alignments, is thus capable of
considering independent domains.
As expected, in the early phases of database growth we see a steady rise in our ability to detect remote
homology (Fig. 2c). However, this improvement does not directly scale with the massive increase in
available sequences particularly evident in the last decade. Even more striking, homology detection
plateaus in 2004 and subsequently shows a slight decline.
The same method was applied to the combined 2007 UniProt and UniMES metagenomic databases
(Fig. 2 – rightmost sample) and substantially fewer homologs were detected from SCOP30. This is
likely to be due to the large number of hypothetical sequences and sequence fragments within
metagenomic datasets, which have previously been shown to adversely affect the quality of PSIBLAST profiles (24).
The cutting edge of remote homology detection is based on matching hidden Markov models and
achieves substantially superior performance to PSI-BLAST. We performed the same historical
analysis with one of the leading examples of such a program, HHsearch (25). While not plateauing,
performance improvements over time are slowing, and this slowing is particularly evident when
compared to the scale of sequence discovery (Fig. 2c).
It is also worth noting that although HHsearch performs far better than PSI-BLAST it is impractical
for it to be used on full sequence databases as it would require an all-vs all search of the sequence
database to build models for every sequence. It is restricted to smaller databases such as those
containing sequences of known structures or individual genomes. This highlights one problem with
the increase in size shown in sequence databases: the increasing impossibility of certain analyses,
including all-vs-all searches (26). Some of the issues in dealing with large databases with high
redundancy have been previously discussed (7) and are hinted at by the slight decline seen in the PSIBLAST homology hits for 2007. When the PSI-BLAST and HHSearch protocols were re-run on a
database consisting of sequence representatives from our islands (see Supporting Information Fig. 4),
we found that the reduced information performed better in the majority of cases, especially in the later
databases with more redundancy. These results indicate that the idea of a representative – as opposed
to global – map of the protein universe is a realistic and possibly important goal.
Discussion
Our map of the protein universe is reliant on the sequences available to us from various sequencing
projects. As we gain novel data we expect our resolution of this map to increase and eventually
become a global map fully representing the space inhabited by all existing sequences. Obtaining a
representative map will be achievable long before we have sequenced all of Earth’s biodiversity
because of the similarities inherent in proteins due to sharing a common evolutionary origin. These
similarities mean that proteins will exist in islands, the distribution, shape and density of which reflect
their relationship to each other. When we reach a point where any new sequence can be placed in a
pre-existing island we will have obtained the representative map. In our study we investigated our
progress towards such a map over the last twenty years of sequence data acquisition.
5
The method for clustering sequences into islands was chosen due to its speed and the large amount of
data analysed. However, for reasons given in the Results section, this choice means our sequence
islands are a conservative measure of sequence novelty. Previous work (14) (15) showing a linear
growth in islands with added genomes suffered from similar limitations, which also makes their
estimations conservative. Both studies used a small fraction of our dataset, approximately 1999 and
2001 respectively.
Methods for remote homology detection such as PSI-BLAST are able to routinely identify
relationships with sequence identity less than 25%. Being able to detect these remote relationships is
vital to our ability to map the protein universe and predict the function and structure of proteins. The
PSI-BLAST paper (19) has over 30,000 citations in the literature making it the most highly-cited of
the past decade. This reflects the crucial role played by remote homology detection for the accurate
inference of the relationships between protein sequence, structure, function and evolution.
It has been assumed (12, 13) that the continued growth of the sequence database, even in the absence
of novel algorithm development, will bring with it a steady improvement in our ability to detect
remote homology. One of the sources of information which powerful methods of remote homology
detection rely on is bridging sequences that connect distantly related protein families. If, as it appears,
we are approaching a representative picture of the global sequence map, then the rate of discovery of
such bridging sequences will progressively decline whilst the vast majority of sequences will fall into
pre-existing clusters. There are obvious parallels between these elusive bridging sequences and the
‘missing links’ in palaeontology. As with the missing links, many of these bridging sequences will be
transitional forms, or present in extinct lineages that will never be sequenced. Although sequence
space is continuous, its population by evolution is not.
That a representative protein sequence map may nearly be in our grasp may not be wholly surprising.
In terms of protein three-dimensional structure, such a map appears to be near completion. Fold space,
the space of distinct three-dimensional protein topologies, appears to be populated by a relatively
small number of protein folds, variously estimated at between 1,000 and 10,000 (28). Although there
is some debate regarding the discreteness or continuity of fold space, it is clear that the majority of
protein structures fall within a limited range of fold islands (29). With the aid of structural genomics
initiatives the number of experimentally determined protein structures has been growing rapidly while
the rate of novel fold discovery is slowing considerably (16). This indicates our image of fold space is
changing ever more slowly and that we are approaching a full representation of the protein structural
repertoire (30).
Two primary factors have governed our progress to date in remote homology detection and the
insights it generates into the relationships between protein sequence, structure, function and evolution:
novel algorithm development and the growth in available sequence information. In light of the
evidence presented here, we have reason to expect a diminishing role for the latter sooner rather than
later. If we wish to further our understanding of evolution by connecting the branches of the tree of
life, we will require both sustained development of new and more powerful algorithms for searching
for homology and a greater reliance on different sources of experimental data, such as the structures
being provided by structural genomics initiatives. In the face of data lost in evolutionary time such as
transient bridging sequences that connect the evolutionary map, it may now be timely to focus
attention on attempts to generate these missing links artificially. Some groups have created extra
diversity in the sequence databases by creating artificial sequences using multiple sequence
alignments and a set of structural rules (31) or using phylogenies as a guide to re-create ancestral
sequences (32). When these sequences are added to databases, an improvement is seen in remote
homology detection.
It is of course important to recognise that the findings reported here could be modified by the
sequencing of radically different organisms to those already analysed. Unarguably, sequencing
projects demonstrate some degree of bias in their choice of organism to sequence (33). However,
metagenomics has little or no such bias and yet demonstrates no improvements in the rate of
6
discovery of novel sequence islands. Although the number of unique protein domain sequences is
vast, it is nonetheless finite. It is inevitable that at some point the discovery of truly novel sequences
will become an extremely rare event and eventually all but cease. The surprising indications from this
work are how close we already appear to be to this representative map of the protein universe.
Materials and Methods
The entire analysis presented here took approximately 10 CPU years.
Recreation of past databases. The UniProt (34) databases from 1987 to 2007 were recreated using
the January 2008 UniProt_trembl.dat and UniProt_trembl.fasta files (available from the UniProt FTP
site).
Sequences were added to each database if they were found to have existed in UniProt before a given
date according to the DT line within the UniProt_trembl.dat file. PSI-BLAST searchable binary files
were created for each of the new databases using formatdb.
Metagenomic dataset. Metagenomic sequences were downloaded from the UniProt Metagenomic
and Environmental Sequences database (UniMES) (35). UniMES contains data from the Global
Ocean Sampling Expedition (GOS) (21). The downloaded fasta file is non-redundant to a 100%
threshold and contains approximately 6 million predicted sequences. A database was created which
contained the full 2007 sequence data + this metagenomic dataset. A 50% non-redundant version of
this database was also created (see below).
Sequence clustering using CD-HIT. CD-HIT (4) is a program that clusters sequence databases
according to a sequence identity threshold using a short word filtering heuristic. Representative
sequences are selected from each cluster and are used to form a new sequence database. CD-HIT is
the standard tool for creating representative databases and has been used by UniProt to create their
UniRef (6) reduced redundancy databases.
CD-HIT uses greedy incremental clustering. First, a sequence database is sorted according to
sequence length and the longest sequence is chosen as the representative of the first cluster. Every
other remaining sequence is then compared to the cluster representative and added to the cluster if
the similarity is above a certain threshold (50% in our study). The next longest remaining sequence is
then selected as a representative of a new cluster and the process continues until all sequences are
assigned a cluster.
CD-HIT was run on databases for every other year from 1987 and at a relatively high threshold of
50% sequence identity, due to the high level of computer resources required (100 CPU weeks for this
data). In addition, a combined 2007 + UniMES (metagenomic sequence) database was created and
CD-HIT was run on this database of just over 10 million sequences at a threshold of 50%. For each
processed database, a new database was produced, consisting of the representative sequence from
each cluster. The size of each of these representative databases provided our measure for the
number of islands.
Prediction of future growth is sequence islands. The number of new islands per new
sequence was calculated for each year between 1987 and 2007. A power-law curve was then fitted to
this data and extended until 2050.
Creating the SCOP30 test set. SCOP30, containing SCOP version 1.73 (11) sequences which share
no more than 30% global sequence similarity, was downloaded from ASTRAL (36)
(http://astral.berkeley.edu). These sequences were placed into homologous groups according to
superfamily membership defined within the SCOP domain classification. A PSI-BLAST searchable
binary SCOP30 file was created using formatdb.
Construction of database specific PSI-BLAST profiles. Each sequence within the SCOP30 test
set was searched against each of the recreated UniProt databases using four iterations of PSI-BLAST
with an inclusion threshold of 10-3. At the end of the fourth iteration a checkpoint file and PSSM was
output.
7
To subsequently test the robustness of our result, PSI-BLAST profiles were also output from 2,3 and
5 iterations. The results of which are shown in Supporting Figure 1. For the result with the consistently
highest homologs detected (4 iterations), another run was made with a more stringent inclusion
threshold of 10-6 (Supporting Fig. 2). The same trend was observed under all these conditions.
Identification of remote homologies within the SCOP30 test set. Each SCOP30 sequence was
searched against the entire SCOP30 database using the profiles output from the previous searches of
the recreated sequence databases. This was accomplished by initiating a single iteration of PSIBLAST, restarting using the checkpoint files previously created. The number of SCOP30 sequences
belonging to the same superfamily as the query below an e-value threshold of 0.1 was recorded. The
proportion of predictions that were false positives at this threshold varied between 3.3 and 6.9% with
a mean of 5.1%.
Randomised databases analysis. It is possible that the trend identified by PSI-BLAST is a result of
the order of discovery of sequences. For example, later databases might contain certain pathological
sequences, adversely effecting homology detection. To test this, the datasets were recreated with
randomized orders of discovery, whilst fixing the number of sequences for a given year. The same
trend was observed (Supporting Fig. 3).
HHsearch. HHsearch (25) is a highly sensitive sequence alignment method based on the pairwise
comparison of profile Hidden Markov Models (HMMs). An HMM was created for each sequence in the
SCOP30 test set using each UNIPROT database (1987-2007). The PSI-BLAST parameters used by
HHsearch in the creation of the HMMs were identical to those used in the PSI-BLAST runs (4
iterations with an inclusion threshold of 0.001). No secondary structure information was used. An all
against all search of the HMMs was conducted for each year and all hits within the same superfamily
with a confidence (HHsearch score) greater than 95% were recorded. The proportion of predictions
that were false positives at this threshold varied between 0.2% and 4.6% with a mean of 2.6%.
Homology detection using representative databases. The same PSIBLAST and HHsearch
procedures were applied to the sequence databases formed from the island representatives that were
created by CD-HIT (Supporting Fig 4).
Acknowledgements
D.C., B.R.J. and L.A.K. are supported by the Biotechnology and Biological Sciences Research
Council.
Author contributions
D.C. designed and wrote all code, performed the analysis and wrote the paper. B.J. supervised the
main analysis, prepared the figure, and contributed additional analyses. L.A.K conceived the study,
supervised the analysis and wrote the paper. M.J.E.S. supervised the work and contributed to its
interpretation.
Figure Legends
Figure 1. Cartoon illustration of the progressive population of our sequence map. Early stages are
characterised by the creation of new sequence islands. In contrast, later stages are characterised by the
population of existing islands leading to an asymptotic approach to the complete protein map.
Figure 2. Plots (a) to (c) show three different views of the change in sequence space over the last two
decades. All three plots use the same horizontal axis. 2007 plus GOS indicates the combination of the
2007 sequence database with the Global Ocean Survey metagenomics data. (a) The protein sequence
database (black bars) has grown exponentially over the past two decades. A much smaller increase is
seen in the number of sequence islands (red bars) at a level of 50% sequence identity. This is
8
particularly the case with the metagenomic data (striped) which appears to have high redundancy.
Note that the vertical axis uses a different scale for the top and bottom halves, in order to show growth
both in the early sequence databases and the more recent ones. (b) The black line indicates the ratio of
the number of islands to the total number of sequences, representing the rate of novel island
discovery. Until 1995 this rate was steady or growing. Since that time this rate has been falling. There
is a sharp drop on the addition of metagenomic data. (b) This plot shows the change in the ability of
PSI-BLAST (blue line) and HHsearch (red line) to detect homology using profiles built from the
databases of each year. It is clear that the more computationally intense HHsearch detects more
homologs in the majority of cases. Although there is a general improvement in both methods over
time, this improvement slows and in the case of PSI-BLAST, it plateaus and even begins to decline.
The addition of metagenomic data adversely affects both methods.
Figure 3. A prediction of future island growth is made by fitting a power-law curve to the number of
new islands per new sequence in each year from 1987-2007. The GOS metagenomic data is not
included in our projection. We predict that by approximately 2040, 90% of new sequences will fit
within an existing island.
Figures
Figure 1
9
Figure 2
Figure 3
10
References
1. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC (2007) The Genomes On Line
Database (GOLD) in 2007: status of genomic and metagenomic projects and their
associated metadata. Nucl. Acids Res.:gkm884.
2. Leipe DD (1996) Biodiversity, genomes, and DNA sequence databases. Current Opinion
in Genetics & Development 6:686-691.
3. Crandall KA, Buhay JE (2004) EVOLUTION: Genomic Databases and the Tree of Life.
Science 306:1144-1145.
4. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of
protein or nucleotide sequences. Bioinformatics 22:1658-9.
5. Park J, Holm L, Heger A, Chothia C (2000) RSDB: representative protein sequence
databases have high information content. Bioinformatics 16:458-64.
6. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH (2007) UniRef: comprehensive
and non-redundant UniProt reference clusters. Bioinformatics 23:1282-1288.
7. Li W, Jaroszewski L, Godzik A (2002) Sequence clustering strategies improve remote
homology recognitions while reducing search times. Protein Eng 15:643-9.
8. Chothia C, Gough J, Vogel C, Teichmann SA (2003) Evolution of the protein repertoire.
Science 300:1701-1703.
9. Apic G, Gough J, Teichmann SA (2001) Domain combinations in archaeal, eubacterial
and eukaryotic proteomes. J. Mol. Biol 310:311-325.
10. Todd AE, Orengo CA, Thornton JM (2001) Evolution of function in protein
superfamilies, from a structural perspective. Journal of Molecular Biology 307:11131143.
11. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification
of proteins database for the investigation of sequences and structures. J Mol Biol
247:536-40.
12. Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER:
new methods for improved protein fold recognition and superfamily discrimination.
Bioinformatics 25:1761-1767.
13. Sandhya S, Kishore S, Sowdhamini R, Srinivasan N (2003) Effective detection of remote
homologues by searching in sequence dataset of a protein domain fold. FEBS Letters
552:225-230.
14. Kunin V, Cases I, Enright A, de Lorenzo V, Ouzounis C (2003) Myriads of protein
families, and still counting. Genome Biology 4:401.
11
15. Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA (2006) Comprehensive genome
analysis of 203 genomes provides structural genomics with new insights into protein
family space. Nucl. Acids Res. 34:1066-1080.
16. Levitt M (2009) Nature of the protein universe. Proc. Natl. Acad. Sci. U.S.A 106:1107911084.
17. Finn RD et al. (2010) The Pfam protein families database. Nucl. Acids Res. 38:D211-222.
18. Geer LY, Domrachev M, Lipman DJ, Bryant SH (2002) CDART: Protein Homology by
Domain Architecture. Genome Research 12:1619-1623.
19. Altschul S et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucl. Acids Res. 25:3389-3402.
20. Pearson WR, Sierk ML (2005) The limits of protein sequence comparison? Curr. Opin.
Struct. Biol 15:254-260.
21. Yooseph S et al. (2007) The Sorcerer II Global Ocean Sampling expedition: expanding
the universe of protein families. PLoS Biol 5:e16.
22. Li W, Wooley JC, Godzik A (2008) Probing metagenomics by rapid cluster analysis of
very large datasets. PLoS ONE 3:e3375.
23. Ostell J (2005) Databases of Discovery. Queue 3:40-48.
24. Tress ML, Cozzetto D, Tramontano A, Valencia A (2006) An analysis of the Sargasso
Sea resource and the consequences for database composition. BMC Bioinformatics 7:213.
25. Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics
21:951-60.
26. Hutchison CA (2007) DNA sequencing: bench to bedside and beyond. Nucl. Acids
Res.:gkm688.
27. Thornton JM, Orengo CA, Todd AE, Pearl FMG (1999) Protein folds, functions and
evolution. Journal of Molecular Biology 293:333-342.
28. Wolf YI, Grishin NV, Koonin EV (2000) Estimating the number of protein folds and
families from complete genome data. Journal of Molecular Biology 299:897-905.
29. Sadreyev RI, Kim B, Grishin NV (2009) Discrete-continuous duality of protein structure
space. Curr. Opin. Struct. Biol 19:321-328.
30. Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J (2006) On the origin and
highly likely completeness of single-domain protein structures. Proceedings of the
National Academy of Sciences of the United States of America 103:2605-2610.
31. Pei J, Dokholyan NV, Shakhnovich EI, Grishin NV (2003) Using protein design for
12
homology detection and active site searches. Proc. Natl. Acad. Sci. U.S.A 100:1136111366.
32. Cai W, Pei J, Grishin NV (2004) Reconstruction of ancestral protein sequences and its
applications. BMC Evol Biol. 4:33.
33. Kyrpides NC (2009) Fifteen years of microbial genomics: meeting the challenges and
fulfilling the dream. Nat Biotech 27:627-632.
34. Wu CH et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of
protein information. Nucleic Acids Res 34:D187-91.
35. Consortium TU (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids
Res. 37:D169–D174.
36. Brenner SE, Koehl P, Levitt M (2000) The ASTRAL compendium for protein structure
and sequence analysis. Nucl. Acids Res. 28:254-256.
13
Download