Transfer overview - Structural Bioinformatics Group

advertisement
Transfer Report
Daniel Chubb
Supervisor : Professor Mike Sternberg
1. Table of Contents
2.
Index of figures .................................................................................................................. 3
3.
Transfer overview .............................................................................................................. 5
4.
Introduction ....................................................................................................................... 6
4.1.
Protein evolution and function ................................................................................... 6
4.1.1.
The protein structure -> sequence -> function evolutionary paradigm .............. 6
4.1.2.
Homology ............................................................................................................. 6
4.1.3.
Protein sequence similarity ................................................................................. 8
4.1.4.
Pairwise sequence similarity ................................................................................ 8
4.1.5.
Intermediate sequence searching ....................................................................... 8
4.1.6.
Substitution matrices ........................................................................................... 9
4.1.7.
The importance of conservation and sequence profiles ................................... 10
4.1.8.
CS-Blast : Context-specific substitutions ........................................................... 10
4.1.9.
Extracting extra information from homology hits – Networks and noise ......... 11
4.1.10.
4.1.
5.
Ancestral sequences....................................................................................... 13
The growth of sequence space ................................................................................. 14
Turning back the sequence clock ..................................................................................... 15
5.1.
Results ....................................................................................................................... 16
5.1.1.
The effect of sequence increase on homology detection ................................. 16
5.1.2.
Homology detection using randomized sequence databases ........................... 17
5.1.3.
Investigating family relationships ...................................................................... 18
5.1.4.
HHsearch ............................................................................................................ 20
5.1.5.
Detectable homologs are gained and lost in consecutive years ....................... 21
5.1.6.
Sequence redundancy has a negative impact on homology detection............. 22
5.1.7.
Profile flattening and the difficulties of ‘one-size-fits all’ profiles .................... 25
5.1.8.
False positives. ................................................................................................... 27
5.1.9.
Orphans .............................................................................................................. 28
5.1.10.
Transitive homology detection over a network ............................................. 29
6.
CASP 08 ............................................................................................................................ 30
7.
Discussion......................................................................................................................... 31
8.
Future plans ..................................................................................................................... 33
9.
References ....................................................................................................................... 34
2. Index of figures
Figure 1 – Although proteins A and C to not have the significant similarity required to be
considered homologous, they share significant similarity with an intermediate protein
sequence (B) and can therefore be considered to be homologous themselves. ...................... 9
Figure 2 - CS-BLAST calculates the substitution probabilities for each residue based upon
its surrounding context (red box) . ......................................................................................... 11
Figure 3. An alignment algorithm (in this case blast) is used to identify a list of hits for each
sequence in a protein sequence database. A new sequence can search for putative
homologs in this sequence database by first identifying hits (using the same method) and
then comparing these hits with the pre-calculated ones of the sequence database. ............ 12
Figure 4 – The similarity in sequence of related proteins varies according to the time since
they diverged from a common ancestor. In this figure, the similarity between the A – 2 and
B – 2 generation ( is less than the similarity between the A-1 and B-1 generations which in
turn is less than the A and B sequences which diverged from the last common ancestor. ... 13
Figure 4. Although all the proteins in groups A,B and C from year X are homologous, only a
few of these relationships can be detected using methods described in Error! Reference
source not found. (shown as an edge). New sequences are introduced in year X+1 and
homologies are detected within the currently held sequences. These new sequences have
acted a bridges in sequence space, allowing homology to be inferred between all sequences
of A,B and C. ............................................................................................................................. 14
Figure 5. A chart showing the size of the UNIPROT protein database from 1987-2007. Due to
the scale, to aid readability the number of sequences present in each year for the first
decade has been inserted above the bars. .............................................................................. 15
Figure 6. Sequence database yearly test procedure. (1) End of year databases are recreated
from the UNIPROT database. (2) Sequences of SCOP30 domains are extracted from SCOP.
(3) Each SCOP30 sequence is searched against each UNIPROT database and a PSIBLAST
profile is output. (4) Each PSIBLAST profile is searched against the full SCOP30 database and
putative homologs are identified. ........................................................................................... 16
Figure 7. The number of distant homologs detected by PSIBLAST is compared to the size of
the database used in the construction of the sequence profiles. ........................................... 17
Figure 8. The results from Figure 7 are compared to the number of homologs found by
PSIBLAST when the order of protein sequence discovery is randomized. Because of the close
agreement of all four randomizations, it is only possible to see three of the plots. .............. 18
Figure 9 - The number of same family homologs detected by PSIBLAST is compared to the
size of the database used in the construction of the sequence profiles. ................................ 19
Figure 10. - The number of distant homologs detected by HHsearch is compared to the size
of the database used in the construction of the HMM profiles. ............................................. 20
Figure 11. A chart comparing the homologs gained and lost each year using PSIBLAST. For
example, in 2007, the blue bar shows that approximately 3000 sequences from 2006 are not
found, while approximately 2500 are present which were not in 2005. ................................ 21
Figure 12 - A chart comparing the homologs gained and lost each year using HHsearch. ..... 22
Figure 13. An orange bar is added to Figure 7, showing the amount non redundant (at 60%)
sequence being added to the sequence database each year. ................................................ 24
Figure 14. The homolog detection analysis against the Uniref50 database is added to Figure
7. Uniref50 contains no sequences with a pair-wise identity >50%. Profiles created using
Uniref50 perform better than all other databases when detecting homologs. ...................... 25
Figure 15 – The SD of each residue (row in the PSIBLAST profile) is calculated and then
averaged to calculate the SD for a given profile. ..................................................................... 26
Figure 16 - The standard deviation is calculated for each PSIBLAST profile ........................... 26
Figure 17 – The number of false positives found using HHsearch and PSIBLAST. .................. 27
Figure 18 – The number of sequences that are unable to find any homologs based upon
profiles created from the sequences present in a given year. ................................................ 28
Figure 19. – By allowing all hits to be transitively assigned to each sequence in a connected
component, all false positives would also be inherited, increasing the amount of errors
present. .................................................................................................................................... 29
3. Transfer overview
This report describes work undertaken to investigate homology detection within the context
of a changing sequence information landscape. The primary objective was to explore how
the different protein sequences deposited in databases each year affect homology detection
algorithms and therefore the classification of proteins, including structure and function
prediction.
The most obvious change in sequence data is the huge increases in quantity observed over
the past two decades (10^3 to 10^6 increase) - it is assumed that this extra information has
helped homology detection by filling gaps in sequence space (Figure 5) . The following
hypothesis was tested:
"The massive increase in available protein sequences seen over the last twenty years has
led to improvements in protein homology detection."
The results of the investigation showed that the influx of sequences had an initial net
positive impact, but over time, homology detection peaked (in 2004 in the dataset) and
further increases (~ 3x) has resulted in a net reduction in homology detection from the peak
value. It was further demonstrated that each year the change in sequences always result in
a combination of gains and losses of detectable homologs. This result was supported using
different homology detection methods, using both a set of distantly related homologs and
homologs which closer relationships.
It is clear that the makeup of the sequence database has an important affect of homology
prediction algorithms. The remainder of the report investigates this cause of this affect, so
that a method for improving homology detection can be developed that is not prone to the
same errors. This includes both analysis of original data and a comparison on similar work
undertaken by others.
The research undertaken during this project was instrumental in the construction of the
highly successful Imperial college CASP structural prediction server. The contribution to this
server is also discussed.
Finally, a plan of action is put forth for the next 15 months.
4. Introduction
This chapter outlines the principles underlying protein homology detection and their
importance in protein classification. The evolution of proteins and their homologous
relationships is discussed along with a review of the past and present methods of homology
detection, within the context of a changing sequence information landscape.
4.1.
Protein evolution and function
4.1.1. The protein structure -> sequence -> function evolutionary paradigm
The work described in this report is informed by and relies upon the basic paradigm that
the function of a protein is determined by - in order of importance - its structure, amino acid
and DNA sequence. Redundancy in codon codes mean that changes in the DNA sequence
does not necessarily result in a change of amino acid. Where a change does occur, the
similarity of amino acid properties (e.g. hydrophobicity and poliarity) means that such a
change will not necessarily impact on the structure of the protein sufficiently to cause a
functional change. Finally, not all parts of a protein structure is vital for a given function, so
a portion of a structure can change while maintaining the same function e.g. the region
responsible for a catalytic site vs. a disordered loop. It is therefore the case that the
structure of important functional sites in proteins is under a selective pressure to remain
conserved and this conservation extends to a varied extent to the amino acid and DNA
sequence.
4.1.2. Homology
Homologous proteins are those that have diverged from a common ancestral sequence.
They can be divided in to two groups : orthologs and paralogs, that differ by their mode of
initial divergence (Jensen 2001). Orthologs occur during speciation where a protein present
in one species is now present in both newly diverged species. Initially, the sequences,
structure and function of these orthologs are identical, but will over time, diverge in
response to the unique selective pressures acting upon them. Paralogs are the result of a
gene duplication event; in this case, two copies of a pre-existing gene are now present
within the same organism/species. These two paralogous sequences can evolve separately.
Orthologs are thought to be more likely to retain the same function in the course of
evolution, whereas paralogs are more free to evolve new functions, usually related to the
original one (Tatusov et al. 1997). Successive speciation and duplication events over an
evolutionary timescale can result in large families of homologous proteins, with varying
degrees of similarity in their sequence, structure and function (Orengo & Thornton 2005).
Detecting these homologous relationships has become the most powerful tool for predicting
the biological function of a gene of interest. Given an annotated protein, one can infer the
function from a well annotated homolog. For example : a huge insight in to carcinogenesis
was provided by discovering that some human tumour repressor genes were homologous to
yeast DNA repair genes (Wheals 1985; DeFeo-Jones et al. 1983). The importance of
homology detection for function prediction is highlighted by the range of available methods
and servers utilising this paradigm (Emes 2008; Hunter et al. 2009; D. M. A. Martin et al.
2004; Hawkins et al. 2009; Wass & Sternberg 2008; Lee et al. 2007).
Another area which has shown to be highly reliant on homology detection is protein
structure prediction. Of the two main varieties of structure prediction; template based and
template free, it is template based that has had the most success in predicting the structure
of a protein based upon sequence (Moult et al. 2003). Template based methods, also known
as ‘threading’, take a sequence and attempt to thread it on to a homolog with known
structure. This technique can be used to predict the structure of sequences from even
distantly related proteins. An example of a successful structure prediction server is the
imperial college Phyre server which is discussed in section 6.
To assign homology between two proteins, a measurement or proxy of relatedness needs to
be decided upon and algorithms need to be devised to detect said measure. As previously
discussed, homologous proteins share similarity in their structure and their amino acid
sequence - the degree of which is related to the level of chronological and functional
divergence. Of the two, structure is most likely to be conserved as sequence is able to
diverge substantially without affecting the structure and therefore function. The most
reliable assignment of homology is therefore based upon similarity in structural features
between proteins. The SCOP (Murzin et al. 1995) database is an example of protein
classification using similarities in structure. SCOP classifies on the basis of individual protein
domains - as they are thought to be self-folding fundamental evolutionary units. Of course,
assigning structural similarities relies upon the presence of a solved crystal structure, either
from X-ray crystallography or NMR. This is a highly time consuming process, especially in the
case of trans-membrane proteins.
By far, the most complete data we have for proteins is sequence data. There are currently
(as of January 21st 2009) 4512 ongoing and complete genomes listed on the GOLD genomes
online database (Liolios et al. 2007) covering all branches of the tree of life. This amounts to
7,001,017 sequences within the UNIPROT (Wu et al. 2006) sequence database – in
comparison with 55660 solved structures within the Protein Data Bank (PDB) (Berman et al.
2000) . It is not surprising therefore; that the most commonly used measure for the
detection of homology is protein sequence similarity.
4.1.3. Protein sequence similarity
As homologous sequences have diverged from a common ancestor, their sequences will
share a degree of similarity that varies depending on the chronological and functional
difference between them. In contrast, any similarities in the sequences of unrelated
proteins will most often be due to chance. Sequence based homology detection relies upon
this distinction, identifying homologs from the vast amount of unrelated background
sequences. This process is trivial for closely related homologs which share a high degree of
sequence similarity – however, the more distantly two proteins are related the harder it
becomes to align them in order to distinguish from background noise.
The following sections describe the commonly used methods for protein sequence based
homology detection.
4.1.4. Pairwise sequence similarity
The simplest way of detecting homology considers only two sequences at a time, i.e. is
sequence A significantly similar to sequence B. BLAST (S F Altschul et al. 1990) is arguably
the most popular tool that uses this method. Blast matches short 'words' from a query
sequence to a database of protein sequences, this local match (alignment) is then
extended until it falls below a scoring threshold. An expectation score is given to each
match representing the number of hits of equal significance you would expect to find by
chance in the search database.
Close homologs have low e-values while distant homologs have e-values which are
indistinguishable from unrelated sequences and therefore cannot be assigned as homologs
without including many false positives.
4.1.5. Intermediate sequence searching
From considering isolated homologous pairs, the next most intuitive step is using
intermediate sequences (Figure 1).
Significant match
A
Significant match
B
C
Infer homology between A and C
Figure 1 – Although proteins A and C to not have the significant similarity required to be considered homologous, they
share significant similarity with an intermediate protein sequence (B) and can therefore be considered to be
homologous themselves.
For example : A, B and C are all homologs, A and B are closely related, as are B and C, the
relationship between A and C is more distant. Even if the relationship between A and C is
undetectable when considered in isolation, it is inferred by the detectable relationship
between A-B and B-C. ISS - intermediate sequence searching (Park et al. 1998) using
iterative blasts use this procedure. ISS is more sensitive than BLAST but is computationally
more expensive and fails to detect homology if sufficient intermediate sequences of high
confidence are not available.
4.1.6. Substitution matrices
It is more likely for an amino acid to mutate to a residue with similar chemical properties; as
such a change is less likely to interfere with the correct folding and activity of the protein.
For example, a hydrophobic residue such as leucine is more likely to mutate to another
hydrophobic residue such as valine. Alignment programs such as BLAST utilize this
information, in the form of substation matrices, the mostly widely used of which are the
BLOSSUM matrices (S. Henikoff & J. G. Henikoff 1992).
The BLOSUM matrices are calculated by recording the substitutions within conserved
sequence blocks of homologous proteins from a multiple sequence alignment. Within these
blocks, sequences are clustered according to their percentage sequenced ID; substitution
frequencies are then taken between clusters. This method reduces the bias from closely
related sequences. The BLOSUM62 matrix sets the cluster threshold at 62% and has been
shown to provide good detection rates for a range of homologs (S. Henikoff & J. G. Henikoff
1992).
Substitution matrices such as BLOSSUM described above represent a general background
frequency for amino acid substitutions. They do not take account of the specific substitution
frequencies present due to conservation within specific regions of protein family. For
example, it is likely that a polar residue on the surface of a protein is not well conserved and
can be substituted by any other polar residue; the same residue in the core of a protein may
be absolutely conserved as it allows a stabilizing hydrogen bond to be formed. Therefore, a
more reliable method is to take account of the sequence conservation particular to a given
family of related sequences.
4.1.7. The importance of conservation and sequence profiles
Not all positions in a protein sequence are of equal importance when considering homology.
Some positions are highly conserved within homologous groups and are therefore more
informative than other more variable positions. Profile matching techniques take advantage
of this extra information in order to identify distant homologs. The most widely used
example of a profile matching tool – PSIBLAST (SF Altschul et al. 1997) (27,068 citations as of
January 2009) works in the following way: In its first iteration, PSI-BLAST identifies
sequences that meet a specific inclusion threshold (the same as a single BLAST utilizing a
substitution matrix such as BLOSUM62) and uses these sequences to generate a profile or
position specific score matrix (PSSM) to be used for the next iterative search. This PSSM
records amino acids present in each aligned position and is then used to align and score
sequences in the database to search for new statistically significant hits. This process is
performed until a fixed number of iterations have been completed or the model converges
or no new statistically significant sequences are found.
Sequence Profile techniques such as PSIBLAST or those using HMMS (e.g. SAM-Tx packages
(K Karplus et al. 1998)) are more sensitive as they are able to contain more evolutionary
information than a single lone sequence. They are however prone to errors if a false positive
result is incorporated within a profile. If this occurs the profile is polluted and can bring in
more and more erroneous results, drifting away from the correct region of sequence space.
It is also debatable whether a profile is capable of representing all the sequences within
highly divergent families without being diluted to a point where false positives are included
- again leading to drift; this possibility will be discussed in more detail later on.
A third class of homology detection methods are profile-profile searches. Libraries of
profiles are created for each sequence in a database as described above and homology is
detected by then comparing profiles; this method has been shown to be more sensitive but
much more computationally intensive. An example of a profile-profile matching tool is
HHsearch (Söding 2005), which compares HMM profiles and is currently a part of the
imperial college structural bioinformatics CASP server.
4.1.8. CS-Blast : Context-specific substitutions
As discussed above, simple substitution matrices lack the information regarding context and
conservation and this situation is remedied by using iterative profile generating methods
such as PSIBLAST. PSIBLAST does however still rely upon one an initial BLAST run, using a
(usually BLOSSUM62) substitution matrix. If this initial query sequence has no close
homologs then the BLAST may return no confident hits to be aligned and used in profile
construction and the power of PSIBLAST cannot be utilized, resulting in no hits and the
sequence will be defined as an orphan. It would therefore be highly beneficial if the initial
search was improved, using extra context based information without the need for profile
construction. Such an approach, termed CS-Blast has recently been developed and has some
initial highly positive results (Biegert & Soding 2009).
Figure 2 - CS-BLAST calculates the substitution
probabilities for each residue based upon its surrounding
context (red box) .
The context is queried against a pre-compiled library of
context profiles. A library profile contributes to the CSBLAST context profile with a weight determined by its
similarity to the sequence context
Instead of looking at a single position in a sequence, CS-BLAST calculates the substitution
likelihood for each residue based upon its local context, as summarized in Figure 2. As with
the PSIBLAST profiles, the CS-BLAST context-based profiles captures extra structural and
evolutionary constraints which are missed by the simple single residue based substitution
matrices. The resulting profiles detect more evolutionary distant sequences and can be used
to jump start PSIBLAST.
4.1.9. Extracting extra information from homology hits – Networks and noise
A fourth class of homology detection methods attempts to extract more information from
the hits generated from the previously described methods. Two examples are described
below.
RankProp (Noble et al. 2005). RankProp creates a weighted graph from the results of
homology searches and then conducts a random walk across the network starting at a
particular sequence/node and scores putative homologs based upon the amount of time
spend on any given node during the walk. The initial weights are equal to the score given by
the alignment algorithm e.g. e-value for Blast, and are then normalized to a value between 0
and 1. The chance of making a transition from one node to another in the network is equal
to this normalized score. RankProp therefore manages to extract additional information
from the whole network structure than is picked up on by the alignment programs, which
only explore local regions of the network. The creation of such a network does however,
require a significant computational overhead as an all against all search needs to be
performed and the network itself needs to be constructed. This is not a trivial task for
databases like Uniprot with > 7,000,000 sequences.
SCOOP (Bateman & Finn 2007). SCOOP works by ranking co-occurrence of hits from other
homology detection programs such as BLAST, this is described in Figure 3. If two sequences
share the same hits (scored as observes/observed-expected) then they are likely to be
homologs.
Figure 3. An alignment algorithm (in this case blast) is used to identify a list of hits for each sequence in a protein
sequence database. A new sequence can search for putative homologs in this sequence database by first identifying hits
(using the same method) and then comparing these hits with the pre-calculated ones of the sequence database.
SCOOP can find homologs even when the compared hits are insignificant, showing that
there is useful data in the ‘noise’ as well as in the confident hits. Again this method requires
a library of precompiled blast hits and is not realistic for large databases. SCOOP was utilized
in the imperial college CASP server, using a small database of structurally defined sequences
(see section 6).
4.1.10.
Ancestral sequences
The sequence similarity between homologous proteins is usually dependent upon how
distantly related they are, or more formally: the time since both sequences shared a
common ancestor. One would therefore expect ancestral sequences along parallel related
lineages to have a higher degree of sequence similarity as they diverged more recently from
their common ancestor (Figure 4).
Figure 4 – The similarity in sequence of related proteins varies according to the time since they diverged from a common
ancestor. In this figure, the similarity between the A – 2 and B – 2 generation ( is less than the similarity between the A-1
and B-1 generations which in turn is less than the A and B sequences which diverged from the last common ancestor.
If ancestral sequences could be added to the sequence databases, in theory they would be
able to pull branches closer together, thus possibly allowing previously unidentified
relationships to be discovered. Unfortunately, by definition we do not possess ancestral
sequences; there are however methods of reconstructing them. Ancescon (Cai et al. 2004)
uses distance based phylogenetic inference to recreate ancestral sequences of sequence
lineages. Cai et al. demonstrated that when the hypothetical ancestors generated by
Ancescon were added to databases of present day sequences, there was an improvement in
profile-based sequence similarity searches. The use of such a technique is discussed more in
section 8.
4.1.
The growth of sequence space
Improvements in sequencing technology have led to a rapid increase in the accumulation of
sequence information. It is commonly believed that as new protein sequences are obtained,
the above methods will increase in accuracy as gaps in sequence space are bridged, allowing
more distant homologs to be found (described in Figure 5).
Two proteins with detectable homology
A new sequence is added bridging a gap
In sequence space
A
C
B
Protein database with detectable homologies in year X
Protein database with detectable homologies in year X+1
Figure 5. Although all the proteins in groups A,B and C from year X are homologous, only a few of these relationships can
be detected using methods described in Error! Reference source not found. (shown as an edge). New sequences are
introduced in year X+1 and homologies are detected within the currently held sequences. These new sequences have
acted a bridges in sequence space, allowing homology to be inferred between all sequences of A,B and C.
Over the past two decades sequence space has been increasing exponentially (Figure 6), if
this belief is true one would expect an associated increase in our ability to detect remote
homologs using profile based described above. Work undertaken to test this expectation is
discussed below.
Figure 6. A chart showing the size of the UNIPROT protein database from 1987-2007. Due to the scale, to aid readability
the number of sequences present in each year for the first decade has been inserted above the bars.
5. Turning back the sequence clock
A profile can be constructed for a given protein sequence (as described in 4.1.7), the
contents of which are dependent on the sequence space represented within the searched
database. If the massive increase in the number of available sequences has had an impact
on homology detection, one would expect that the sequence profiles generated by PSIBLAST
from databases in different years to have differing abilities to detect a known homology.
This investigation assessed whether this was the case.
The basic methodology is shown in Figure 7 and is described here. The sequence databases
of the past two decades were recreated by extracting sequences from the universal protein
resource database (Uniprot) (Wu et al. 2006) according to their creation date. The
benchmark of homology was same-superfamily membership within the SCOP (Murzin et al.
1995) database of structural protein domains. Specifically, the sequences from SCOP30, a
subset of the SCOP database containing sequences possessing a similarity of no more than
30% to each other; any homologs within this set are considered to be remote. PSIBLAST
profiles were created for each of these sequences from each sequence database (from 1987
to 2007) using four iterations. Each sequence profile was then searched against the fixed
SCOP30 sequence database and the number of positive hits was recorded.
The effect of the discovery order of new sequences and redundancy within the databases
were also investigated, this is discussed in more detail in the results (5.1) and discussion (7)
sections.
Figure 7. Sequence database yearly test procedure. (1) End of year databases are recreated from the UNIPROT database.
(2) Sequences of SCOP30 domains are extracted from SCOP. (3) Each SCOP30 sequence is searched against each
UNIPROT database and a PSIBLAST profile is output. (4) Each PSIBLAST profile is searched against the full SCOP30
database and putative homologs are identified.
5.1.
Results
5.1.1. The effect of sequence increase on homology detection
For each year between 1987 and 2007, PSIBLAST profiles were created for each sequence
within the scop30 set (6982 sequences). When each of these profiles was searched against
the fixed scop30 database the number of hits below the threshold of 0.1 (e-value) was
recorded and each was assessed for same superfamily membership. If the hits belonged to
the same superfamily as the query sequence/profile then it was counted as a detected
homolog. The number of detected homologs in response to database year/size is shown in
Figure 8.
Figure 8. The number of distant homologs detected by PSIBLAST is compared to the size of the database used in the
construction of the sequence profiles.
The results show that the initial addition of new sequences is accompanied by an increase in
homology detection; however, by 2004 homology detection plateaus and thereafter it
declines. Described below are steps taken to investigate this interesting trend.
5.1.2. Homology detection using randomized sequence databases
The results to some extent rely upon the assumption that sequences are added in a random
fashion. To test whether the results shown can be explained by trends in the addition of
sequences, the datasets were recreated with randomized orders of discovery. The same
tests were run on these new randomized databases. Figure 9 shows the number of
homologs detected in each of the randomizations, plotted alongside the original results;
notice that the same trend can be seen, although the randomizations perform better at
each stage.
Figure 9. The results from Figure 8 are compared to the number of homologs found by PSIBLAST when the order of
protein sequence discovery is randomized. Because of the close agreement of all four randomizations, it is only possible
to see three of the plots.
The obvious explanation for the improved performance is that by randomizing the order of
discovery, a better coverage of sequence space is obtained, especially in the earlier
databases. This increased coverage would lead to an improvement in homology detection.
5.1.3. Investigating family relationships
The scop30 sequence set was originally chosen to a) focus on remote homology – as
identifying close homologs is easy, and b) reduce the amount of searches. Although it has
been shown that distantly related homologs can still share a high degree of functional and
structural similarity, this is not always the case (Todd et al. 2001; Redfern et al. 2008; Rost
2002). The results would be of greater interest if a similar trend was also seen for homologs
with a more obvious similarity. To address this, a similar investigation was instigated,
replacing the scop30 sequences with sequences selected from the same SCOP family.
Sequences within a SCOP family have a sequence ID >30% and/or similar structures of
functions (Murzin et al. 1995).
Ten random SCOP families belonging to different superfamilies were selected from the
SCOP90 database. Similar to SCOP30, SCOP90 contains only sequences with a sequence
ID<90%. Fifty sequences were selected at random from each family. The reason for using
SCOP90 was to remove the identical and near identical sequences present within SCOP, e.g.
sequences from the same protein but solved by different groups.
Each of these sequences was searched against the yearly databases and the results are
shown in Figure 10.
Figure 10 - The number of same family homologs detected by PSIBLAST is compared to the size of the database used in
the construction of the sequence profiles.
The results in Figure 10 show the same general trend as previously identified for the SCOP30
sequences; an initial positive impact of additional sequence is followed by a leveling off and
then a reduction in detected homologs. The detection response to the changing databases is
however a lot flatter than the one seen for the SCOP30 results. Such a response is expected
from a set of highly similar sequences, where one would always expect a large number of
homologs to be detected.
The results show that the previously identified trend is important for both distant homologs
and those which are known to have similar structure and/or function.
5.1.4. HHsearch
As discussed in section 4.1.7 , Psiblast is not the only method of homology detection.
Profile-profile based tools have been shown to achieve better results, although they are
significantly slower. This is due to the need to construct the profiles for every sequence in a
database prior to running a search. This profile construction will take at least as long as a
PSIBLAST run, therefore profile-profile approaches are unrealistic for large databases and
are primarily used on smaller databases. This approach is therefore often used to find
homologs within smaller structural sequence databases (PDB, SCOP) for the purposes of fold
recognition and structure prediction. HHsearch (Söding 2005) is one such tool and has been
demonstrated to have cutting edge performance in remote homology prediction. HHsearch
was a component of the imperial college CASP server (and many other CASP servers), which
came joint third overall in the recent CASP 08 assessment. In order to test whether the
trends reported were specific to PSIBLAST, a similar experiment was run using HHSEARCH.
HMMs were constructed for each SCOP sequence for each year and an all against all
HHSEARCH was then run on the HMM databases, the results can be seen in Figure 11.
Figure 11. - The number of distant homologs detected by HHsearch is compared to the size of the database used in the
construction of the HMM profiles.
The results Figure 11 show that HHSEARCH clearly finds more homologs than PSIBLAST (95%
confidence) and the number of homologs found does not plateau like the PSIBLAST results.
However, the rate of discovery does not bare a close relationship to the amount of new
information (sequences) present in the database.
5.1.5. Detectable homologs are gained and lost in consecutive years
When studying the PSIBLAST and HHSEARCH results in figures 8 and 11, one might assume
that the difference in homologs found between years is solely due to the detection of new
homologs i.e. 2004 finds all homologs present in 2003 plus x additional sequences or in the
case of 2007, no new homologs are found and x amount are ‘lost’. A more complicated and
possibly realistic explanation is that in each year, the change in number of sequences is
equal to the number of sequences discovered, minus the amount ‘lost’ from the previous
year.
In figures 12 and 13 the number of homologs detected in a given year for PSIBLAST and
HHsearch that are not present in the previous one and the number lost from the previous
year is plotted. The net gains or losses in each year are equal to the difference between the
gains(red/purple) and losses (blue/green) bars.
7000
A comparison of the homologs found using PSIBLAST in
consecutive years
6000
Number of homologs
5000
4000
3000
2000
1000
0
Year
Homologs lost from previous year
New homologs detected not present in previous year
Figure 12. A chart comparing the homologs gained and lost each year using PSIBLAST. For example, in 2007, the blue
bar shows that approximately 3000 sequences from 2006 are not found, while approximately 2500 are present which
were not in 2005.
A comparison of the homologs found using PSIBLAST in
consecutive years
12000
10000
Number of homologs
8000
6000
4000
2000
0
Year
Homologs Losrfrom prvious year
New homologs detected not present in previous year
Figure 13 - A chart comparing the homologs gained and lost each year using HHsearch.
In each year, even when there seems to be significant improvement on the previous year’s
detection rate, homologs are lost. This occurs in both psiblast and hhsearch results. For
some reason, the profiles are unable to cover the ‘old’ and ‘new’ homologs. The reasons for
this are explored later.
5.1.6. Sequence redundancy has a negative impact on homology detection
Although, this work is the first time that homology detection rates have been compared in
past databases, others have investigated the link between databases of different sizes due
to redundancy levels and homology detection (Li et al. 2002; Park et al. 2000a).
Sequence databases such as UNIPROT contain a large proportion of redundant information
in the form of closely related or identical sequences. Nonredundant sequence databases are
those that contain no sequences which share a pairwise identity of less than a defined
amount. For example, the NCBI provide an nr database which removes trivial redundancy,
i.e. identical sequences and UNIPROT provide several databases with redundancy removed
at levels of 100, 90 and 50% - the UNIREF databases (Suzek et al. 2007). These smaller
databases have much the same information content as their full counterparts. Both Li and
Park demonstrated that profiles created using non-redundant databases do not lose
sensitivity when searching for homologs. In fact, it has been shown that smaller non
redundant databases are more efficient at identifying homologs.
Park showed that a 50% non redundant database performed better than a trivially nonredundant one and suggests that PSIBLAST profile construction efficiency is decreased with
overrepresented redundant sequences due to inefficient weighting of sequences and a
more even distribution present in the more non-redundant set (Park et al. 2000b).
Li et al. show a similar result and suggest that the result is due to situation they term the
'profile trap' (Li et al. 2002) where a large protein family is composed of several sub-families
of uneven sizes. They argue that a profile initiated from a member of a smaller sub-family
could be dominated by proteins from the larger subfamily, diverging away from the original
sequence and its close homologs
There are obvious parallels between the results of redundant database analysis and the
work described here, that smaller databases are more efficient at detecting homologs. It
could be therefore that the majority of extra sequences that are being added are redundant
and not providing any additional information, trapping profiles and leading to worse results.
In Figure 14 the sequence database increase is compared to the same sequences filtered to
a level of 60% redundancy compared to the previous year; i.e. in each year, only the amount
of sequences which are <60% redundant to the previous year are added. Redundancy
calculations were performed using the cd-hit algorithm (Li & Godzik 2006). It is clear that a
significant amount of redundancy is present within the databases. However, new sequences
(below the 60% redundancy cut-off) are still being added each year which should contain
useful information.
Figure 14. An orange bar is added to Figure 8, showing the amount non redundant (at 60%) sequence being added to the
sequence database each year.
In order to investigate whether the tailing off trend can be explained by increased levels of
redundancy in the sequence databases, the same analysis was conducted using the Uniref50
database. The Uniref50 database is created by removing sequence redundancy from the
Uniprot database down to a 50% level. The version of Uniref50 used was based upon the
March 2008 Uniprot database.
Figure 15. The homolog detection analysis against the Uniref50 database is added to Figure 8. Uniref50 contains no
sequences with a pair-wise identity >50%. Profiles created using Uniref50 perform better than all other databases when
detecting homologs.
Figure 15 shows that the 50% non redundant Uniref50 database is significantly smaller than
the non redundant 2007 sequence database, backing up the results shown in Figure 14.
There is an obvious jump in homology detection when using the non-redundant database; it
is therefore likely that the sequence trapping and sequence weighting errors described by
Park and Li are having an effect.
5.1.7. Profile flattening and the difficulties of ‘one-size-fits all’ profiles
In order for a profile to detect all available homologs it must fully represent the diversity of
sequences in a given superfamily. In the case of highly diverse superfamilies, this may not be
possible, or if it is, the profile may be so variable that it loses its sensitivity and could
become prone to drift by including false positive sequences. If such a phenomenon was
occurring in this investigation, one might expect to see a flattening in the PSIBLAST profiles.
The presence of flattening was investigated by calculating the mean standard deviation of
each PSIBLAST profile of each SCOP30 sequence for each year (Figure 16).
Figure 16 – The SD of each residue (row in the PSIBLAST profile) is calculated and then averaged to calculate the SD for a
given profile.
The mean SD of profiles over time
12
10
8
6
4
2
0
Mean SD of profile
Figure 17 - The standard deviation is calculated for each PSIBLAST profile
The average SD of profiles in a given year was determined by calculating the mean of all
profile SDs in each year; the results are plotted in Figure 17. It appears that the profile SD
increases until 1997 where it peaks and subsequently declines. This shows a similar trend to
that previously shown for the homology detection although the shapes are not identical.
These results show that the single profiles used in PSIBLAST and HHsearch are not able to
contain the diversity of a superfamily, and this could be contributing deficiencies in
homology detection. It has been previously demonstrated; that multiple profiles (in the
form of HMMS) created from different seed sequences can more accurately identify SCOP
superfamily membership (Gough et al. 2001). Other groups have also shown an
improvement in homology detection using multiple PSIBLAST profiles (Sandhya et al. 2005;
Li et al. 2002).
5.1.8. False positives.
A flatter profile should be less specific and lead to more false positive hits, to explore this
hypothesis, the number of false positive hits were investigated. A false positive is defined as
a PSIBLAST or HHsearch hit that falls within the acceptability threshold.
The number of false positives found using both PSIBLAST and HHsearch is plotted in Figure
18.
False positives found with PSIBLAST and HHsearch
4000
3500
False positives found
3000
2500
2000
Hhsearch_False_Positives
1500
PSIBLAST_False_Positives
1000
500
0
Year
Figure 18 – The number of false positives found using HHsearch and PSIBLAST.
Initially, the number of false positives increases with the size of the database as with the
true positive hits shown in Figures 8 and 11. The false positives also follow the same
plateauing and decreasing pattern, although the peak earlier (2002 for PSIBLAST, 2003 for
HHsearch). If the decrease in true positives was due to a flattening or drift in the profiles
one would expect false positives to increase as the number of true positives decrease,
instead, in both cases the number decreases. This effect could be due to an effect from
redundancy and/or overrepresented families in sequence space. If a profile is dragged
towards a redundant/highly populated region of sequence space, as well as restricting the
number of true positive (homolog) hits, it could restrict the number of false positives, by
focusing the profiles in a more specific region. This hypothesis does however, seem to
conflict with the flattening shown in Figure 17.
5.1.9. Orphans
The picture of homology detection described so far is a complex one, with hits appearing
and disappearing over the years, culminating recently in a negative net effect. The
identification of many of the homologs relies on the mix of available sequences used in the
profile creation. It is however possible that many sequences can never identify a homolog
given our available sequence landscapes; the presence of these ‘orphans’ in the dataset was
investigated, the results are shown in Figure 19. The definition of orphans in this context
includes sequences that can find themselves in the scop30 database using PSIBLAST.
The number of orphan sequencesp resenting
different years
3500
Number of orphans found
3000
Orphans
2500
2000
1500
1000
500
0
Year
Figure 19 – The number of sequences that are unable to find any homologs based upon profiles created from the
sequences present in a given year.
Of the 6982 sequences in scop30, only between 1660 and 3065 can be assigned homologs.
In each year from 1987, there is a trend of decreasing number of orphan sequences, as
increased information in profiles are able to identify distantly related sequences. This trend
continues, even past 2004 where absolute number of detected homologs peaks. Again, this
could be a reflection of profiles drifting towards highly represented areas of sequence
space, which would affect the number of homologs detected but not necessarily the
number of orphans.
5.1.10.
Transitive homology detection over a network
From the results in 5.1.5, it appears that the profiles are not covering the sequence space
required to capture all homologs. It is possible that no profile is able to fit the diversity of all
homologous relationships and is limited to a subset of possible hits by the contents of the
sequence database used in its creation. From the PSIBLAST results, each sequence has a
profile and an associated number of confident hits; do these profiles overlap to cover
sequence space? One way of investigating this is by a form of Intermediate Sequence Search
as previously described.
A network was created (using the networkX python package (Hagberg et al. 2008))
connecting all sequences that were confidently identified by PSIBLAST to be homologous in
2007. If when combined, the profiles were able to cover the required homology space then
all homologous proteins should be within the same connected component. In other words,
there should be a path between any two homologous nodes/sequences.
There were however, a number of false positives present (Figure 18), by including these in
the network, the overall number of false positives will increase as shown in Figure 20.
Figure 20. – By allowing all hits to be transitively assigned to each sequence in a connected component, all false
positives would also be inherited, increasing the amount of errors present.
The resulting network had 6982 nodes, each representing a scop30 sequence. These
sequences fell in to 2138 connected components, of which 1368 only contained one node.
These 1368 singletons are sequences with no significant hits. This number is less than the
1660 orphans identified in 2007 (section 5.1.9) as some of those sequences will have found
false positives or in some cases have been identified as hits by other sequences. This
highlights the fact that profiles are asymmetric, varying depending on what the initial search
sequence was and how it was affected by the local sequence landscape.
Through membership in the same connected component, the number of true positives was
increased to from 52,397 to 70,098. The number of false positives however increases
dramatically from 3025 to 1929059, an increase of over 600x. Without a method of filtering
out the false positive results, the new homologies are lost in a sea of noise. These results
once again show that the contents of profiles designed to find sequences within a
superfamily depend upon the initial start sequence and it is difficult if not impossible for
them to cover all possible homologs.
6. CASP 08
As discussed during the introduction, the identification of homologs is a vital part of
template based structural prediction methods which are the most successful techniques for
predicting structure from sequence. The various methods of structural prediction devised by
different groups are assessed in the biannual CASP competition. In the most recent CASP
assessment, the imperial college structural bioinformatics group prediction server : Phyredenovo came 4th overall out of 71 competing groups (http://tinyurl.com/ajbxok).
Part of the CASP server was an advanced fold recognition module which was responsible for
finding the homologs to use as templates for structure prediction. I contributed to this
section on the CASP server in two ways.
1) The work described in section 5 determined that the UNIREF50 redundant database was
both faster and more efficient at detecting homologs. It was also shown that HHsearch is
a superior PSIBLAST. For these reasons, the structure template library consisted of
HHsearch profiles of sequences with known structure, constructed using the UNIREF50
sequence database.
2) I implemented an in-house version of the SCOOP algorithm as described in section 4.1.9.
This version of SCOOP compares the HHsearch results of an all against all search of the
fold library sequences to those generated from CASP candidate sequences.
7. Discussion
Over the past two decades, the number of protein sequences has increased exponentially
and here, for the first time, the changing nature of this sequence landscape has been
investigated with respect to one of the most important tools in biological research:
homology detection. The detection of closely related homologs is a trivial problem and was
not directly addressed in this work. Instead, the focus was on the harder task of remote
homology detection.
The methods used to detect remote homologies rely upon the creation of profiles
representing the patterns of conservation and propensity to substitute residues within
homologous groups. The composition of these profiles varies depending upon the
sequences available in the databases used in their creation.
The results in 5.1.5 show that in each year, the profiles capture different homologous
relationships as the sequence landscape changes, resulting in either a net gain or loss of
homologs in comparison with the previous year. When the absolute number of homologs
detected each year is compared we see a clear plateauing in our ability to detect remote
homologs with PSIBLAST (5.1.1). The extra information provided by the massive increase in
protein sequences is not efficiently used by the profiles, during the period of 2004-2007 as
sequence space more than triples, more homologs are lost than are gained each year,
resulting in the trend shown in Figure 8. Using HHsearch, a cutting-edge method of
homology detection, a net reduction is not observed, however the detection does seem to
be slowing (Figure 11) and the same pattern of yearly gains and losses is seen (Figure 13).
When a network was created from all of the significant PSIBLAST hits of 2007 (section
5.1.10), it was clear that there were significant differences in profiles starting from different
points in the sequence space.
It appears that profiles are highly sensitive to sequence landscape used during their
construction and simply adding more sequences as we are currently doing will not
necessarily result in an improvement in homology detection. The underlying reasons behind
this trend may have already been partially uncovered.
Li et-al describe a situation they term the 'profile trap' (Li et al. 2002) where a large protein
family is composed of several sub-families of uneven sizes. They argue that a profile
initiated from a member of a smaller sub-family could be dominated by proteins from the
larger subfamily, diverging away from the original sequence and its close homologs. Such a
phenomenon is exacerbated by large amounts of redundancy in the databases, adding
density to specific regions of sequence space. The negative impact of redundancy was
demonstrated in 5.1.6 where it was shown that non redundant databases perform better
than their redundant counterparts.
Another possible cause was shown in 5.1.7, where it was noted that the PSIBLAST profiles
tended to become flatter over time. This could be caused by profiles attempting to
represent highly diverse superfamilies.
Li et-al propose an intermediate profile search procedure where the results from one a
PSIBLAST iteration are clustered according to set sequence identity and each cluster is then
aligned and turned in to a profile used to initialize another PSIBLAST round. This method
improved PSIBLAST results by ~30% in their test set. The cause of this improvement could
have been due to combating both of the above issues, stopping highly redundant sequences
from dominating profiles whilst allowing separate sub families of diverse superfamiles to be
represented without flattening profiles.
The increase in protein sequences show no sign of slowing, on the contrary with large scale
metagenomic sequencing projects(Williamson et al. 2008), we are likely to see an
acceleration in sequencing. The results suggest that not only will this extra sequence data
not be beneficial for homology detection; it could be actively harmful unless new methods
are developed to gain from this increase. As detecting homologs is an important tool in
annotating proteins (as described in section 4.1), the prediction of function and structure is
likely to already been and is likely to continue to be adversely affected by the trend
discovered here. This was highlighted by the results in 5.1.3 which shows that detection
within the same SCOP family group is also adversely effected.
The future work described in the next section will attempt to address the problems
uncovered in the above work, improving homology detection and applying it to structure
and function prediction.
8. Future plans
The discoveries from the above work will be utilized to improve methods of remote
homology prediction.
1. Methods of ancestral reconstruction will be assessed and then utilized to shrink the
distance between sequences in test sequence databases. Initially, this will be used
for defined diverse SCOP superfamilies.
2. A system will be developed using multiple linked profiles to cover the sequence
space required to identify homologs.
3. The additional of CS-BLAST 4.1.8, will be assessed, in the hope of adding its extra
power to the homology detection pipeline.
4. The structural Bioinformatics group has recently been highly successful in the CASP
assessment of structural and functional prediction. The CASP structural prediction
server utilized a non redundant database combined with an HHsearch and SCOOP
method due to the findings of the work described here. The success in function
prediction was primarily down to a manual alignment of structural homologs and
expert assessment of annotation transfer using binging site and conservation.
Improvements made in homology detection using points 1-3 will be applied to both
systems, improving the structural prediction server and automating a function
prediction server, by adding the extra power of remote homology detection and
reducing the input required by an expert user.
9. References
Altschul, S.F. et al., 1990. Basic local alignment search tool. Journal of Molecular Biology,
215(3), 403-10.
Altschul, S. et al., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucl. Acids Res., 25(17), 3389-3402.
Bateman, A. & Finn, R.D., 2007. SCOOP: a simple method for identification of novel protein
superfamily relationships. Bioinformatics (Oxford, England), 23(7), 809-14.
Berman, H.M. et al., 2000. The Protein Data Bank. Nucl. Acids Res., 28(1), 235-242.
Biegert, A. & Soding, J., 2009. Sequence context-specific amino acid similarities: a powerful
paradigm for protein sequence comparison. PNAS (to be published).
Cai, W., Pei, J. & Grishin, N.V., 2004. Reconstruction of ancestral protein sequences and its
applications. BMC Evolutionary Biology, 4, 33.
DeFeo-Jones, D. et al., 1983. ras-Related gene sequences identified and isolated from
Saccharomyces cerevisiae. Nature, 306(5944), 707-9.
Emes, R.D., 2008. Inferring function from homology. Methods in Molecular Biology (Clifton,
N.J.), 453, 149-68.
Gough, J. et al., 2001. Assignment of homology to genome sequences using a library of
hidden Markov models that represent all proteins of known structure. Journal of
Molecular Biology, 313(4), 903-919.
Hagberg, A.A., Schult, D.A. & Swart, P.J., 2008. Exploring network structure, dynamics, and
function using NetworkX. In Proceedings of the 7th Python in Science Conference
(SciPy2008). Pasadena, CA USA, pp. 11–15.
Hawkins, T. et al., 2009. PFP: Automated prediction of gene ontology functional annotations
with confidence scores using protein sequence data. Proteins, 74(3), 566-82.
Henikoff, S. & Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences of the United States of America,
89(22), 10915-10919.
Hunter, S. et al., 2009. InterPro: the integrative protein signature database. Nucleic Acids
Research, 37(Database issue), D211-5.
Jensen, R., 2001. Orthologs and paralogs - we need to get it right. Genome Biology, 2(8),
interactions1002.1-interactions1002.3.
Karplus, K., Barrett, C. & Hughey, R., 1998. Hidden Markov models for detecting remote
protein homologies. Bioinformatics (Oxford, England), 14(10), 846-56.
Lee, D., Redfern, O. & Orengo, C., 2007. Predicting protein function from sequence and
structure. Nature Reviews. Molecular Cell Biology, 8(12), 995-1005.
Li, W. & Godzik, A., 2006. Cd-hit: a fast program for clustering and comparing large sets of
protein or nucleotide sequences. Bioinformatics (Oxford, England), 22(13), 1658-9.
Li, W., Jaroszewski, L. & Godzik, A., 2002. Sequence clustering strategies improve remote
homology recognitions while reducing search times. Protein Engineering, 15(8), 6439.
Liolios, K. et al., 2007. The Genomes On Line Database (GOLD) in 2007: status of genomic
and metagenomic projects and their associated metadata. Nucl. Acids Res., gkm884.
Martin, D.M.A., Berriman, M. & Barton, G.J., 2004. GOtcha: a new method for prediction of
protein function assessed by the annotation of seven genomes. BMC Bioinformatics,
5, 178.
Moult, J. et al., 2003. Critical assessment of methods of protein structure prediction (CASP)round V. Proteins: Structure, Function, and Genetics, 53(S6), 334-339.
Murzin, A.G. et al., 1995. SCOP: a structural classification of proteins database for the
investigation of sequences and structures. Journal of molecular biology, 247(4), 53640.
Noble, W.S. et al., 2005. Identifying remote protein homologs by network propagation. The
FEBS Journal, 272(20), 5119-28.
Orengo, C.A. & Thornton, J.M., 2005. Protein families and their evolution-a structural
perspective. Annual Review of Biochemistry, 74, 867-900.
Park, J. et al., 2000a. RSDB: representative protein sequence databases have high
information content. Bioinformatics (Oxford, England), 16(5), 458-64.
Park, J. et al., 2000b. RSDB: representative protein sequence databases have high
information content. Bioinformatics (Oxford, England), 16(5), 458-64.
Park, J. et al., 1998. Sequence comparisons using multiple sequences detect three times as
many remote homologues as pairwise methods. Journal of Molecular Biology,
284(4), 1201-10.
Redfern, O.C., Dessailly, B. & Orengo, C.A., 2008. Exploring the structure and function
paradigm. Current Opinion in Structural Biology, 18(3), 394-402.
Rost, B., 2002. Enzyme Function Less Conserved than Anticipated. Journal of Molecular
Biology, 318(2), 595-608.
Sandhya, S. et al., 2005. Assessment of a rigorous transitive profile based search method to
detect remotely similar proteins. Journal of Biomolecular Structure & Dynamics,
23(3), 283-98.
Söding, J., 2005. Protein homology detection by HMM-HMM comparison. Bioinformatics
(Oxford, England), 21(7), 951-60.
Suzek, B.E. et al., 2007. UniRef: comprehensive and non-redundant UniProt reference
clusters. Bioinformatics, 23(10), 1282-1288.
Tatusov, R.L., Koonin, E.V. & Lipman, D.J., 1997. A genomic perspective on protein families.
Science (New York, N.Y.), 278(5338), 631-7.
Todd, A.E., Orengo, C.A. & Thornton, J.M., 2001. Evolution of function in protein
superfamilies, from a structural perspective. Journal of Molecular Biology, 307(4),
1113-1143.
Wass, M.N. & Sternberg, M.J.E., 2008. ConFunc--functional annotation in the twilight zone.
Bioinformatics (Oxford, England), 24(6), 798-806.
Wheals, A.E., 1985. Oncogene homologues in yeast. BioEssays, 3(3), 108-112.
Williamson, S.J. et al., 2008. The Sorcerer II Global Ocean Sampling Expedition:
Metagenomic Characterization of Viruses within Aquatic Microbial Samples. PLoS
ONE, 3(1), e1456.
Wu, C.H. et al., 2006. The Universal Protein Resource (UniProt): an expanding universe of
protein information. Nucleic Acids Research, 34(Database issue), D187-91.
Download