Transfer Report Daniel Chubb Supervisor : Professor Mike Sternberg 1. Table of Contents 2. Index of figures .................................................................................................................. 3 3. Transfer overview .............................................................................................................. 5 4. Introduction ....................................................................................................................... 6 4.1. Protein evolution and function ................................................................................... 6 4.1.1. The protein structure -> sequence -> function evolutionary paradigm .............. 6 4.1.2. Homology ............................................................................................................. 6 4.1.3. Protein sequence similarity ................................................................................. 8 4.1.4. Pairwise sequence similarity ................................................................................ 8 4.1.5. Intermediate sequence searching ....................................................................... 8 4.1.6. Substitution matrices ........................................................................................... 9 4.1.7. The importance of conservation and sequence profiles ................................... 10 4.1.8. CS-Blast : Context-specific substitutions ........................................................... 10 4.1.9. Extracting extra information from homology hits – Networks and noise ......... 11 4.1.10. 4.1. 5. Ancestral sequences....................................................................................... 13 The growth of sequence space ................................................................................. 14 Turning back the sequence clock ..................................................................................... 15 5.1. Results ....................................................................................................................... 16 5.1.1. The effect of sequence increase on homology detection ................................. 16 5.1.2. Homology detection using randomized sequence databases ........................... 17 5.1.3. Investigating family relationships ...................................................................... 18 5.1.4. HHsearch ............................................................................................................ 20 5.1.5. Detectable homologs are gained and lost in consecutive years ....................... 21 5.1.6. Sequence redundancy has a negative impact on homology detection............. 22 5.1.7. Profile flattening and the difficulties of ‘one-size-fits all’ profiles .................... 25 5.1.8. False positives. ................................................................................................... 27 5.1.9. Orphans .............................................................................................................. 28 5.1.10. Transitive homology detection over a network ............................................. 29 6. CASP 08 ............................................................................................................................ 30 7. Discussion......................................................................................................................... 31 8. Future plans ..................................................................................................................... 33 9. References ....................................................................................................................... 34 2. Index of figures Figure 1 – Although proteins A and C to not have the significant similarity required to be considered homologous, they share significant similarity with an intermediate protein sequence (B) and can therefore be considered to be homologous themselves. ...................... 9 Figure 2 - CS-BLAST calculates the substitution probabilities for each residue based upon its surrounding context (red box) . ......................................................................................... 11 Figure 3. An alignment algorithm (in this case blast) is used to identify a list of hits for each sequence in a protein sequence database. A new sequence can search for putative homologs in this sequence database by first identifying hits (using the same method) and then comparing these hits with the pre-calculated ones of the sequence database. ............ 12 Figure 4 – The similarity in sequence of related proteins varies according to the time since they diverged from a common ancestor. In this figure, the similarity between the A – 2 and B – 2 generation ( is less than the similarity between the A-1 and B-1 generations which in turn is less than the A and B sequences which diverged from the last common ancestor. ... 13 Figure 4. Although all the proteins in groups A,B and C from year X are homologous, only a few of these relationships can be detected using methods described in Error! Reference source not found. (shown as an edge). New sequences are introduced in year X+1 and homologies are detected within the currently held sequences. These new sequences have acted a bridges in sequence space, allowing homology to be inferred between all sequences of A,B and C. ............................................................................................................................. 14 Figure 5. A chart showing the size of the UNIPROT protein database from 1987-2007. Due to the scale, to aid readability the number of sequences present in each year for the first decade has been inserted above the bars. .............................................................................. 15 Figure 6. Sequence database yearly test procedure. (1) End of year databases are recreated from the UNIPROT database. (2) Sequences of SCOP30 domains are extracted from SCOP. (3) Each SCOP30 sequence is searched against each UNIPROT database and a PSIBLAST profile is output. (4) Each PSIBLAST profile is searched against the full SCOP30 database and putative homologs are identified. ........................................................................................... 16 Figure 7. The number of distant homologs detected by PSIBLAST is compared to the size of the database used in the construction of the sequence profiles. ........................................... 17 Figure 8. The results from Figure 7 are compared to the number of homologs found by PSIBLAST when the order of protein sequence discovery is randomized. Because of the close agreement of all four randomizations, it is only possible to see three of the plots. .............. 18 Figure 9 - The number of same family homologs detected by PSIBLAST is compared to the size of the database used in the construction of the sequence profiles. ................................ 19 Figure 10. - The number of distant homologs detected by HHsearch is compared to the size of the database used in the construction of the HMM profiles. ............................................. 20 Figure 11. A chart comparing the homologs gained and lost each year using PSIBLAST. For example, in 2007, the blue bar shows that approximately 3000 sequences from 2006 are not found, while approximately 2500 are present which were not in 2005. ................................ 21 Figure 12 - A chart comparing the homologs gained and lost each year using HHsearch. ..... 22 Figure 13. An orange bar is added to Figure 7, showing the amount non redundant (at 60%) sequence being added to the sequence database each year. ................................................ 24 Figure 14. The homolog detection analysis against the Uniref50 database is added to Figure 7. Uniref50 contains no sequences with a pair-wise identity >50%. Profiles created using Uniref50 perform better than all other databases when detecting homologs. ...................... 25 Figure 15 – The SD of each residue (row in the PSIBLAST profile) is calculated and then averaged to calculate the SD for a given profile. ..................................................................... 26 Figure 16 - The standard deviation is calculated for each PSIBLAST profile ........................... 26 Figure 17 – The number of false positives found using HHsearch and PSIBLAST. .................. 27 Figure 18 – The number of sequences that are unable to find any homologs based upon profiles created from the sequences present in a given year. ................................................ 28 Figure 19. – By allowing all hits to be transitively assigned to each sequence in a connected component, all false positives would also be inherited, increasing the amount of errors present. .................................................................................................................................... 29 3. Transfer overview This report describes work undertaken to investigate homology detection within the context of a changing sequence information landscape. The primary objective was to explore how the different protein sequences deposited in databases each year affect homology detection algorithms and therefore the classification of proteins, including structure and function prediction. The most obvious change in sequence data is the huge increases in quantity observed over the past two decades (10^3 to 10^6 increase) - it is assumed that this extra information has helped homology detection by filling gaps in sequence space (Figure 5) . The following hypothesis was tested: "The massive increase in available protein sequences seen over the last twenty years has led to improvements in protein homology detection." The results of the investigation showed that the influx of sequences had an initial net positive impact, but over time, homology detection peaked (in 2004 in the dataset) and further increases (~ 3x) has resulted in a net reduction in homology detection from the peak value. It was further demonstrated that each year the change in sequences always result in a combination of gains and losses of detectable homologs. This result was supported using different homology detection methods, using both a set of distantly related homologs and homologs which closer relationships. It is clear that the makeup of the sequence database has an important affect of homology prediction algorithms. The remainder of the report investigates this cause of this affect, so that a method for improving homology detection can be developed that is not prone to the same errors. This includes both analysis of original data and a comparison on similar work undertaken by others. The research undertaken during this project was instrumental in the construction of the highly successful Imperial college CASP structural prediction server. The contribution to this server is also discussed. Finally, a plan of action is put forth for the next 15 months. 4. Introduction This chapter outlines the principles underlying protein homology detection and their importance in protein classification. The evolution of proteins and their homologous relationships is discussed along with a review of the past and present methods of homology detection, within the context of a changing sequence information landscape. 4.1. Protein evolution and function 4.1.1. The protein structure -> sequence -> function evolutionary paradigm The work described in this report is informed by and relies upon the basic paradigm that the function of a protein is determined by - in order of importance - its structure, amino acid and DNA sequence. Redundancy in codon codes mean that changes in the DNA sequence does not necessarily result in a change of amino acid. Where a change does occur, the similarity of amino acid properties (e.g. hydrophobicity and poliarity) means that such a change will not necessarily impact on the structure of the protein sufficiently to cause a functional change. Finally, not all parts of a protein structure is vital for a given function, so a portion of a structure can change while maintaining the same function e.g. the region responsible for a catalytic site vs. a disordered loop. It is therefore the case that the structure of important functional sites in proteins is under a selective pressure to remain conserved and this conservation extends to a varied extent to the amino acid and DNA sequence. 4.1.2. Homology Homologous proteins are those that have diverged from a common ancestral sequence. They can be divided in to two groups : orthologs and paralogs, that differ by their mode of initial divergence (Jensen 2001). Orthologs occur during speciation where a protein present in one species is now present in both newly diverged species. Initially, the sequences, structure and function of these orthologs are identical, but will over time, diverge in response to the unique selective pressures acting upon them. Paralogs are the result of a gene duplication event; in this case, two copies of a pre-existing gene are now present within the same organism/species. These two paralogous sequences can evolve separately. Orthologs are thought to be more likely to retain the same function in the course of evolution, whereas paralogs are more free to evolve new functions, usually related to the original one (Tatusov et al. 1997). Successive speciation and duplication events over an evolutionary timescale can result in large families of homologous proteins, with varying degrees of similarity in their sequence, structure and function (Orengo & Thornton 2005). Detecting these homologous relationships has become the most powerful tool for predicting the biological function of a gene of interest. Given an annotated protein, one can infer the function from a well annotated homolog. For example : a huge insight in to carcinogenesis was provided by discovering that some human tumour repressor genes were homologous to yeast DNA repair genes (Wheals 1985; DeFeo-Jones et al. 1983). The importance of homology detection for function prediction is highlighted by the range of available methods and servers utilising this paradigm (Emes 2008; Hunter et al. 2009; D. M. A. Martin et al. 2004; Hawkins et al. 2009; Wass & Sternberg 2008; Lee et al. 2007). Another area which has shown to be highly reliant on homology detection is protein structure prediction. Of the two main varieties of structure prediction; template based and template free, it is template based that has had the most success in predicting the structure of a protein based upon sequence (Moult et al. 2003). Template based methods, also known as ‘threading’, take a sequence and attempt to thread it on to a homolog with known structure. This technique can be used to predict the structure of sequences from even distantly related proteins. An example of a successful structure prediction server is the imperial college Phyre server which is discussed in section 6. To assign homology between two proteins, a measurement or proxy of relatedness needs to be decided upon and algorithms need to be devised to detect said measure. As previously discussed, homologous proteins share similarity in their structure and their amino acid sequence - the degree of which is related to the level of chronological and functional divergence. Of the two, structure is most likely to be conserved as sequence is able to diverge substantially without affecting the structure and therefore function. The most reliable assignment of homology is therefore based upon similarity in structural features between proteins. The SCOP (Murzin et al. 1995) database is an example of protein classification using similarities in structure. SCOP classifies on the basis of individual protein domains - as they are thought to be self-folding fundamental evolutionary units. Of course, assigning structural similarities relies upon the presence of a solved crystal structure, either from X-ray crystallography or NMR. This is a highly time consuming process, especially in the case of trans-membrane proteins. By far, the most complete data we have for proteins is sequence data. There are currently (as of January 21st 2009) 4512 ongoing and complete genomes listed on the GOLD genomes online database (Liolios et al. 2007) covering all branches of the tree of life. This amounts to 7,001,017 sequences within the UNIPROT (Wu et al. 2006) sequence database – in comparison with 55660 solved structures within the Protein Data Bank (PDB) (Berman et al. 2000) . It is not surprising therefore; that the most commonly used measure for the detection of homology is protein sequence similarity. 4.1.3. Protein sequence similarity As homologous sequences have diverged from a common ancestor, their sequences will share a degree of similarity that varies depending on the chronological and functional difference between them. In contrast, any similarities in the sequences of unrelated proteins will most often be due to chance. Sequence based homology detection relies upon this distinction, identifying homologs from the vast amount of unrelated background sequences. This process is trivial for closely related homologs which share a high degree of sequence similarity – however, the more distantly two proteins are related the harder it becomes to align them in order to distinguish from background noise. The following sections describe the commonly used methods for protein sequence based homology detection. 4.1.4. Pairwise sequence similarity The simplest way of detecting homology considers only two sequences at a time, i.e. is sequence A significantly similar to sequence B. BLAST (S F Altschul et al. 1990) is arguably the most popular tool that uses this method. Blast matches short 'words' from a query sequence to a database of protein sequences, this local match (alignment) is then extended until it falls below a scoring threshold. An expectation score is given to each match representing the number of hits of equal significance you would expect to find by chance in the search database. Close homologs have low e-values while distant homologs have e-values which are indistinguishable from unrelated sequences and therefore cannot be assigned as homologs without including many false positives. 4.1.5. Intermediate sequence searching From considering isolated homologous pairs, the next most intuitive step is using intermediate sequences (Figure 1). Significant match A Significant match B C Infer homology between A and C Figure 1 – Although proteins A and C to not have the significant similarity required to be considered homologous, they share significant similarity with an intermediate protein sequence (B) and can therefore be considered to be homologous themselves. For example : A, B and C are all homologs, A and B are closely related, as are B and C, the relationship between A and C is more distant. Even if the relationship between A and C is undetectable when considered in isolation, it is inferred by the detectable relationship between A-B and B-C. ISS - intermediate sequence searching (Park et al. 1998) using iterative blasts use this procedure. ISS is more sensitive than BLAST but is computationally more expensive and fails to detect homology if sufficient intermediate sequences of high confidence are not available. 4.1.6. Substitution matrices It is more likely for an amino acid to mutate to a residue with similar chemical properties; as such a change is less likely to interfere with the correct folding and activity of the protein. For example, a hydrophobic residue such as leucine is more likely to mutate to another hydrophobic residue such as valine. Alignment programs such as BLAST utilize this information, in the form of substation matrices, the mostly widely used of which are the BLOSSUM matrices (S. Henikoff & J. G. Henikoff 1992). The BLOSUM matrices are calculated by recording the substitutions within conserved sequence blocks of homologous proteins from a multiple sequence alignment. Within these blocks, sequences are clustered according to their percentage sequenced ID; substitution frequencies are then taken between clusters. This method reduces the bias from closely related sequences. The BLOSUM62 matrix sets the cluster threshold at 62% and has been shown to provide good detection rates for a range of homologs (S. Henikoff & J. G. Henikoff 1992). Substitution matrices such as BLOSSUM described above represent a general background frequency for amino acid substitutions. They do not take account of the specific substitution frequencies present due to conservation within specific regions of protein family. For example, it is likely that a polar residue on the surface of a protein is not well conserved and can be substituted by any other polar residue; the same residue in the core of a protein may be absolutely conserved as it allows a stabilizing hydrogen bond to be formed. Therefore, a more reliable method is to take account of the sequence conservation particular to a given family of related sequences. 4.1.7. The importance of conservation and sequence profiles Not all positions in a protein sequence are of equal importance when considering homology. Some positions are highly conserved within homologous groups and are therefore more informative than other more variable positions. Profile matching techniques take advantage of this extra information in order to identify distant homologs. The most widely used example of a profile matching tool – PSIBLAST (SF Altschul et al. 1997) (27,068 citations as of January 2009) works in the following way: In its first iteration, PSI-BLAST identifies sequences that meet a specific inclusion threshold (the same as a single BLAST utilizing a substitution matrix such as BLOSUM62) and uses these sequences to generate a profile or position specific score matrix (PSSM) to be used for the next iterative search. This PSSM records amino acids present in each aligned position and is then used to align and score sequences in the database to search for new statistically significant hits. This process is performed until a fixed number of iterations have been completed or the model converges or no new statistically significant sequences are found. Sequence Profile techniques such as PSIBLAST or those using HMMS (e.g. SAM-Tx packages (K Karplus et al. 1998)) are more sensitive as they are able to contain more evolutionary information than a single lone sequence. They are however prone to errors if a false positive result is incorporated within a profile. If this occurs the profile is polluted and can bring in more and more erroneous results, drifting away from the correct region of sequence space. It is also debatable whether a profile is capable of representing all the sequences within highly divergent families without being diluted to a point where false positives are included - again leading to drift; this possibility will be discussed in more detail later on. A third class of homology detection methods are profile-profile searches. Libraries of profiles are created for each sequence in a database as described above and homology is detected by then comparing profiles; this method has been shown to be more sensitive but much more computationally intensive. An example of a profile-profile matching tool is HHsearch (Söding 2005), which compares HMM profiles and is currently a part of the imperial college structural bioinformatics CASP server. 4.1.8. CS-Blast : Context-specific substitutions As discussed above, simple substitution matrices lack the information regarding context and conservation and this situation is remedied by using iterative profile generating methods such as PSIBLAST. PSIBLAST does however still rely upon one an initial BLAST run, using a (usually BLOSSUM62) substitution matrix. If this initial query sequence has no close homologs then the BLAST may return no confident hits to be aligned and used in profile construction and the power of PSIBLAST cannot be utilized, resulting in no hits and the sequence will be defined as an orphan. It would therefore be highly beneficial if the initial search was improved, using extra context based information without the need for profile construction. Such an approach, termed CS-Blast has recently been developed and has some initial highly positive results (Biegert & Soding 2009). Figure 2 - CS-BLAST calculates the substitution probabilities for each residue based upon its surrounding context (red box) . The context is queried against a pre-compiled library of context profiles. A library profile contributes to the CSBLAST context profile with a weight determined by its similarity to the sequence context Instead of looking at a single position in a sequence, CS-BLAST calculates the substitution likelihood for each residue based upon its local context, as summarized in Figure 2. As with the PSIBLAST profiles, the CS-BLAST context-based profiles captures extra structural and evolutionary constraints which are missed by the simple single residue based substitution matrices. The resulting profiles detect more evolutionary distant sequences and can be used to jump start PSIBLAST. 4.1.9. Extracting extra information from homology hits – Networks and noise A fourth class of homology detection methods attempts to extract more information from the hits generated from the previously described methods. Two examples are described below. RankProp (Noble et al. 2005). RankProp creates a weighted graph from the results of homology searches and then conducts a random walk across the network starting at a particular sequence/node and scores putative homologs based upon the amount of time spend on any given node during the walk. The initial weights are equal to the score given by the alignment algorithm e.g. e-value for Blast, and are then normalized to a value between 0 and 1. The chance of making a transition from one node to another in the network is equal to this normalized score. RankProp therefore manages to extract additional information from the whole network structure than is picked up on by the alignment programs, which only explore local regions of the network. The creation of such a network does however, require a significant computational overhead as an all against all search needs to be performed and the network itself needs to be constructed. This is not a trivial task for databases like Uniprot with > 7,000,000 sequences. SCOOP (Bateman & Finn 2007). SCOOP works by ranking co-occurrence of hits from other homology detection programs such as BLAST, this is described in Figure 3. If two sequences share the same hits (scored as observes/observed-expected) then they are likely to be homologs. Figure 3. An alignment algorithm (in this case blast) is used to identify a list of hits for each sequence in a protein sequence database. A new sequence can search for putative homologs in this sequence database by first identifying hits (using the same method) and then comparing these hits with the pre-calculated ones of the sequence database. SCOOP can find homologs even when the compared hits are insignificant, showing that there is useful data in the ‘noise’ as well as in the confident hits. Again this method requires a library of precompiled blast hits and is not realistic for large databases. SCOOP was utilized in the imperial college CASP server, using a small database of structurally defined sequences (see section 6). 4.1.10. Ancestral sequences The sequence similarity between homologous proteins is usually dependent upon how distantly related they are, or more formally: the time since both sequences shared a common ancestor. One would therefore expect ancestral sequences along parallel related lineages to have a higher degree of sequence similarity as they diverged more recently from their common ancestor (Figure 4). Figure 4 – The similarity in sequence of related proteins varies according to the time since they diverged from a common ancestor. In this figure, the similarity between the A – 2 and B – 2 generation ( is less than the similarity between the A-1 and B-1 generations which in turn is less than the A and B sequences which diverged from the last common ancestor. If ancestral sequences could be added to the sequence databases, in theory they would be able to pull branches closer together, thus possibly allowing previously unidentified relationships to be discovered. Unfortunately, by definition we do not possess ancestral sequences; there are however methods of reconstructing them. Ancescon (Cai et al. 2004) uses distance based phylogenetic inference to recreate ancestral sequences of sequence lineages. Cai et al. demonstrated that when the hypothetical ancestors generated by Ancescon were added to databases of present day sequences, there was an improvement in profile-based sequence similarity searches. The use of such a technique is discussed more in section 8. 4.1. The growth of sequence space Improvements in sequencing technology have led to a rapid increase in the accumulation of sequence information. It is commonly believed that as new protein sequences are obtained, the above methods will increase in accuracy as gaps in sequence space are bridged, allowing more distant homologs to be found (described in Figure 5). Two proteins with detectable homology A new sequence is added bridging a gap In sequence space A C B Protein database with detectable homologies in year X Protein database with detectable homologies in year X+1 Figure 5. Although all the proteins in groups A,B and C from year X are homologous, only a few of these relationships can be detected using methods described in Error! Reference source not found. (shown as an edge). New sequences are introduced in year X+1 and homologies are detected within the currently held sequences. These new sequences have acted a bridges in sequence space, allowing homology to be inferred between all sequences of A,B and C. Over the past two decades sequence space has been increasing exponentially (Figure 6), if this belief is true one would expect an associated increase in our ability to detect remote homologs using profile based described above. Work undertaken to test this expectation is discussed below. Figure 6. A chart showing the size of the UNIPROT protein database from 1987-2007. Due to the scale, to aid readability the number of sequences present in each year for the first decade has been inserted above the bars. 5. Turning back the sequence clock A profile can be constructed for a given protein sequence (as described in 4.1.7), the contents of which are dependent on the sequence space represented within the searched database. If the massive increase in the number of available sequences has had an impact on homology detection, one would expect that the sequence profiles generated by PSIBLAST from databases in different years to have differing abilities to detect a known homology. This investigation assessed whether this was the case. The basic methodology is shown in Figure 7 and is described here. The sequence databases of the past two decades were recreated by extracting sequences from the universal protein resource database (Uniprot) (Wu et al. 2006) according to their creation date. The benchmark of homology was same-superfamily membership within the SCOP (Murzin et al. 1995) database of structural protein domains. Specifically, the sequences from SCOP30, a subset of the SCOP database containing sequences possessing a similarity of no more than 30% to each other; any homologs within this set are considered to be remote. PSIBLAST profiles were created for each of these sequences from each sequence database (from 1987 to 2007) using four iterations. Each sequence profile was then searched against the fixed SCOP30 sequence database and the number of positive hits was recorded. The effect of the discovery order of new sequences and redundancy within the databases were also investigated, this is discussed in more detail in the results (5.1) and discussion (7) sections. Figure 7. Sequence database yearly test procedure. (1) End of year databases are recreated from the UNIPROT database. (2) Sequences of SCOP30 domains are extracted from SCOP. (3) Each SCOP30 sequence is searched against each UNIPROT database and a PSIBLAST profile is output. (4) Each PSIBLAST profile is searched against the full SCOP30 database and putative homologs are identified. 5.1. Results 5.1.1. The effect of sequence increase on homology detection For each year between 1987 and 2007, PSIBLAST profiles were created for each sequence within the scop30 set (6982 sequences). When each of these profiles was searched against the fixed scop30 database the number of hits below the threshold of 0.1 (e-value) was recorded and each was assessed for same superfamily membership. If the hits belonged to the same superfamily as the query sequence/profile then it was counted as a detected homolog. The number of detected homologs in response to database year/size is shown in Figure 8. Figure 8. The number of distant homologs detected by PSIBLAST is compared to the size of the database used in the construction of the sequence profiles. The results show that the initial addition of new sequences is accompanied by an increase in homology detection; however, by 2004 homology detection plateaus and thereafter it declines. Described below are steps taken to investigate this interesting trend. 5.1.2. Homology detection using randomized sequence databases The results to some extent rely upon the assumption that sequences are added in a random fashion. To test whether the results shown can be explained by trends in the addition of sequences, the datasets were recreated with randomized orders of discovery. The same tests were run on these new randomized databases. Figure 9 shows the number of homologs detected in each of the randomizations, plotted alongside the original results; notice that the same trend can be seen, although the randomizations perform better at each stage. Figure 9. The results from Figure 8 are compared to the number of homologs found by PSIBLAST when the order of protein sequence discovery is randomized. Because of the close agreement of all four randomizations, it is only possible to see three of the plots. The obvious explanation for the improved performance is that by randomizing the order of discovery, a better coverage of sequence space is obtained, especially in the earlier databases. This increased coverage would lead to an improvement in homology detection. 5.1.3. Investigating family relationships The scop30 sequence set was originally chosen to a) focus on remote homology – as identifying close homologs is easy, and b) reduce the amount of searches. Although it has been shown that distantly related homologs can still share a high degree of functional and structural similarity, this is not always the case (Todd et al. 2001; Redfern et al. 2008; Rost 2002). The results would be of greater interest if a similar trend was also seen for homologs with a more obvious similarity. To address this, a similar investigation was instigated, replacing the scop30 sequences with sequences selected from the same SCOP family. Sequences within a SCOP family have a sequence ID >30% and/or similar structures of functions (Murzin et al. 1995). Ten random SCOP families belonging to different superfamilies were selected from the SCOP90 database. Similar to SCOP30, SCOP90 contains only sequences with a sequence ID<90%. Fifty sequences were selected at random from each family. The reason for using SCOP90 was to remove the identical and near identical sequences present within SCOP, e.g. sequences from the same protein but solved by different groups. Each of these sequences was searched against the yearly databases and the results are shown in Figure 10. Figure 10 - The number of same family homologs detected by PSIBLAST is compared to the size of the database used in the construction of the sequence profiles. The results in Figure 10 show the same general trend as previously identified for the SCOP30 sequences; an initial positive impact of additional sequence is followed by a leveling off and then a reduction in detected homologs. The detection response to the changing databases is however a lot flatter than the one seen for the SCOP30 results. Such a response is expected from a set of highly similar sequences, where one would always expect a large number of homologs to be detected. The results show that the previously identified trend is important for both distant homologs and those which are known to have similar structure and/or function. 5.1.4. HHsearch As discussed in section 4.1.7 , Psiblast is not the only method of homology detection. Profile-profile based tools have been shown to achieve better results, although they are significantly slower. This is due to the need to construct the profiles for every sequence in a database prior to running a search. This profile construction will take at least as long as a PSIBLAST run, therefore profile-profile approaches are unrealistic for large databases and are primarily used on smaller databases. This approach is therefore often used to find homologs within smaller structural sequence databases (PDB, SCOP) for the purposes of fold recognition and structure prediction. HHsearch (Söding 2005) is one such tool and has been demonstrated to have cutting edge performance in remote homology prediction. HHsearch was a component of the imperial college CASP server (and many other CASP servers), which came joint third overall in the recent CASP 08 assessment. In order to test whether the trends reported were specific to PSIBLAST, a similar experiment was run using HHSEARCH. HMMs were constructed for each SCOP sequence for each year and an all against all HHSEARCH was then run on the HMM databases, the results can be seen in Figure 11. Figure 11. - The number of distant homologs detected by HHsearch is compared to the size of the database used in the construction of the HMM profiles. The results Figure 11 show that HHSEARCH clearly finds more homologs than PSIBLAST (95% confidence) and the number of homologs found does not plateau like the PSIBLAST results. However, the rate of discovery does not bare a close relationship to the amount of new information (sequences) present in the database. 5.1.5. Detectable homologs are gained and lost in consecutive years When studying the PSIBLAST and HHSEARCH results in figures 8 and 11, one might assume that the difference in homologs found between years is solely due to the detection of new homologs i.e. 2004 finds all homologs present in 2003 plus x additional sequences or in the case of 2007, no new homologs are found and x amount are ‘lost’. A more complicated and possibly realistic explanation is that in each year, the change in number of sequences is equal to the number of sequences discovered, minus the amount ‘lost’ from the previous year. In figures 12 and 13 the number of homologs detected in a given year for PSIBLAST and HHsearch that are not present in the previous one and the number lost from the previous year is plotted. The net gains or losses in each year are equal to the difference between the gains(red/purple) and losses (blue/green) bars. 7000 A comparison of the homologs found using PSIBLAST in consecutive years 6000 Number of homologs 5000 4000 3000 2000 1000 0 Year Homologs lost from previous year New homologs detected not present in previous year Figure 12. A chart comparing the homologs gained and lost each year using PSIBLAST. For example, in 2007, the blue bar shows that approximately 3000 sequences from 2006 are not found, while approximately 2500 are present which were not in 2005. A comparison of the homologs found using PSIBLAST in consecutive years 12000 10000 Number of homologs 8000 6000 4000 2000 0 Year Homologs Losrfrom prvious year New homologs detected not present in previous year Figure 13 - A chart comparing the homologs gained and lost each year using HHsearch. In each year, even when there seems to be significant improvement on the previous year’s detection rate, homologs are lost. This occurs in both psiblast and hhsearch results. For some reason, the profiles are unable to cover the ‘old’ and ‘new’ homologs. The reasons for this are explored later. 5.1.6. Sequence redundancy has a negative impact on homology detection Although, this work is the first time that homology detection rates have been compared in past databases, others have investigated the link between databases of different sizes due to redundancy levels and homology detection (Li et al. 2002; Park et al. 2000a). Sequence databases such as UNIPROT contain a large proportion of redundant information in the form of closely related or identical sequences. Nonredundant sequence databases are those that contain no sequences which share a pairwise identity of less than a defined amount. For example, the NCBI provide an nr database which removes trivial redundancy, i.e. identical sequences and UNIPROT provide several databases with redundancy removed at levels of 100, 90 and 50% - the UNIREF databases (Suzek et al. 2007). These smaller databases have much the same information content as their full counterparts. Both Li and Park demonstrated that profiles created using non-redundant databases do not lose sensitivity when searching for homologs. In fact, it has been shown that smaller non redundant databases are more efficient at identifying homologs. Park showed that a 50% non redundant database performed better than a trivially nonredundant one and suggests that PSIBLAST profile construction efficiency is decreased with overrepresented redundant sequences due to inefficient weighting of sequences and a more even distribution present in the more non-redundant set (Park et al. 2000b). Li et al. show a similar result and suggest that the result is due to situation they term the 'profile trap' (Li et al. 2002) where a large protein family is composed of several sub-families of uneven sizes. They argue that a profile initiated from a member of a smaller sub-family could be dominated by proteins from the larger subfamily, diverging away from the original sequence and its close homologs There are obvious parallels between the results of redundant database analysis and the work described here, that smaller databases are more efficient at detecting homologs. It could be therefore that the majority of extra sequences that are being added are redundant and not providing any additional information, trapping profiles and leading to worse results. In Figure 14 the sequence database increase is compared to the same sequences filtered to a level of 60% redundancy compared to the previous year; i.e. in each year, only the amount of sequences which are <60% redundant to the previous year are added. Redundancy calculations were performed using the cd-hit algorithm (Li & Godzik 2006). It is clear that a significant amount of redundancy is present within the databases. However, new sequences (below the 60% redundancy cut-off) are still being added each year which should contain useful information. Figure 14. An orange bar is added to Figure 8, showing the amount non redundant (at 60%) sequence being added to the sequence database each year. In order to investigate whether the tailing off trend can be explained by increased levels of redundancy in the sequence databases, the same analysis was conducted using the Uniref50 database. The Uniref50 database is created by removing sequence redundancy from the Uniprot database down to a 50% level. The version of Uniref50 used was based upon the March 2008 Uniprot database. Figure 15. The homolog detection analysis against the Uniref50 database is added to Figure 8. Uniref50 contains no sequences with a pair-wise identity >50%. Profiles created using Uniref50 perform better than all other databases when detecting homologs. Figure 15 shows that the 50% non redundant Uniref50 database is significantly smaller than the non redundant 2007 sequence database, backing up the results shown in Figure 14. There is an obvious jump in homology detection when using the non-redundant database; it is therefore likely that the sequence trapping and sequence weighting errors described by Park and Li are having an effect. 5.1.7. Profile flattening and the difficulties of ‘one-size-fits all’ profiles In order for a profile to detect all available homologs it must fully represent the diversity of sequences in a given superfamily. In the case of highly diverse superfamilies, this may not be possible, or if it is, the profile may be so variable that it loses its sensitivity and could become prone to drift by including false positive sequences. If such a phenomenon was occurring in this investigation, one might expect to see a flattening in the PSIBLAST profiles. The presence of flattening was investigated by calculating the mean standard deviation of each PSIBLAST profile of each SCOP30 sequence for each year (Figure 16). Figure 16 – The SD of each residue (row in the PSIBLAST profile) is calculated and then averaged to calculate the SD for a given profile. The mean SD of profiles over time 12 10 8 6 4 2 0 Mean SD of profile Figure 17 - The standard deviation is calculated for each PSIBLAST profile The average SD of profiles in a given year was determined by calculating the mean of all profile SDs in each year; the results are plotted in Figure 17. It appears that the profile SD increases until 1997 where it peaks and subsequently declines. This shows a similar trend to that previously shown for the homology detection although the shapes are not identical. These results show that the single profiles used in PSIBLAST and HHsearch are not able to contain the diversity of a superfamily, and this could be contributing deficiencies in homology detection. It has been previously demonstrated; that multiple profiles (in the form of HMMS) created from different seed sequences can more accurately identify SCOP superfamily membership (Gough et al. 2001). Other groups have also shown an improvement in homology detection using multiple PSIBLAST profiles (Sandhya et al. 2005; Li et al. 2002). 5.1.8. False positives. A flatter profile should be less specific and lead to more false positive hits, to explore this hypothesis, the number of false positive hits were investigated. A false positive is defined as a PSIBLAST or HHsearch hit that falls within the acceptability threshold. The number of false positives found using both PSIBLAST and HHsearch is plotted in Figure 18. False positives found with PSIBLAST and HHsearch 4000 3500 False positives found 3000 2500 2000 Hhsearch_False_Positives 1500 PSIBLAST_False_Positives 1000 500 0 Year Figure 18 – The number of false positives found using HHsearch and PSIBLAST. Initially, the number of false positives increases with the size of the database as with the true positive hits shown in Figures 8 and 11. The false positives also follow the same plateauing and decreasing pattern, although the peak earlier (2002 for PSIBLAST, 2003 for HHsearch). If the decrease in true positives was due to a flattening or drift in the profiles one would expect false positives to increase as the number of true positives decrease, instead, in both cases the number decreases. This effect could be due to an effect from redundancy and/or overrepresented families in sequence space. If a profile is dragged towards a redundant/highly populated region of sequence space, as well as restricting the number of true positive (homolog) hits, it could restrict the number of false positives, by focusing the profiles in a more specific region. This hypothesis does however, seem to conflict with the flattening shown in Figure 17. 5.1.9. Orphans The picture of homology detection described so far is a complex one, with hits appearing and disappearing over the years, culminating recently in a negative net effect. The identification of many of the homologs relies on the mix of available sequences used in the profile creation. It is however possible that many sequences can never identify a homolog given our available sequence landscapes; the presence of these ‘orphans’ in the dataset was investigated, the results are shown in Figure 19. The definition of orphans in this context includes sequences that can find themselves in the scop30 database using PSIBLAST. The number of orphan sequencesp resenting different years 3500 Number of orphans found 3000 Orphans 2500 2000 1500 1000 500 0 Year Figure 19 – The number of sequences that are unable to find any homologs based upon profiles created from the sequences present in a given year. Of the 6982 sequences in scop30, only between 1660 and 3065 can be assigned homologs. In each year from 1987, there is a trend of decreasing number of orphan sequences, as increased information in profiles are able to identify distantly related sequences. This trend continues, even past 2004 where absolute number of detected homologs peaks. Again, this could be a reflection of profiles drifting towards highly represented areas of sequence space, which would affect the number of homologs detected but not necessarily the number of orphans. 5.1.10. Transitive homology detection over a network From the results in 5.1.5, it appears that the profiles are not covering the sequence space required to capture all homologs. It is possible that no profile is able to fit the diversity of all homologous relationships and is limited to a subset of possible hits by the contents of the sequence database used in its creation. From the PSIBLAST results, each sequence has a profile and an associated number of confident hits; do these profiles overlap to cover sequence space? One way of investigating this is by a form of Intermediate Sequence Search as previously described. A network was created (using the networkX python package (Hagberg et al. 2008)) connecting all sequences that were confidently identified by PSIBLAST to be homologous in 2007. If when combined, the profiles were able to cover the required homology space then all homologous proteins should be within the same connected component. In other words, there should be a path between any two homologous nodes/sequences. There were however, a number of false positives present (Figure 18), by including these in the network, the overall number of false positives will increase as shown in Figure 20. Figure 20. – By allowing all hits to be transitively assigned to each sequence in a connected component, all false positives would also be inherited, increasing the amount of errors present. The resulting network had 6982 nodes, each representing a scop30 sequence. These sequences fell in to 2138 connected components, of which 1368 only contained one node. These 1368 singletons are sequences with no significant hits. This number is less than the 1660 orphans identified in 2007 (section 5.1.9) as some of those sequences will have found false positives or in some cases have been identified as hits by other sequences. This highlights the fact that profiles are asymmetric, varying depending on what the initial search sequence was and how it was affected by the local sequence landscape. Through membership in the same connected component, the number of true positives was increased to from 52,397 to 70,098. The number of false positives however increases dramatically from 3025 to 1929059, an increase of over 600x. Without a method of filtering out the false positive results, the new homologies are lost in a sea of noise. These results once again show that the contents of profiles designed to find sequences within a superfamily depend upon the initial start sequence and it is difficult if not impossible for them to cover all possible homologs. 6. CASP 08 As discussed during the introduction, the identification of homologs is a vital part of template based structural prediction methods which are the most successful techniques for predicting structure from sequence. The various methods of structural prediction devised by different groups are assessed in the biannual CASP competition. In the most recent CASP assessment, the imperial college structural bioinformatics group prediction server : Phyredenovo came 4th overall out of 71 competing groups (http://tinyurl.com/ajbxok). Part of the CASP server was an advanced fold recognition module which was responsible for finding the homologs to use as templates for structure prediction. I contributed to this section on the CASP server in two ways. 1) The work described in section 5 determined that the UNIREF50 redundant database was both faster and more efficient at detecting homologs. It was also shown that HHsearch is a superior PSIBLAST. For these reasons, the structure template library consisted of HHsearch profiles of sequences with known structure, constructed using the UNIREF50 sequence database. 2) I implemented an in-house version of the SCOOP algorithm as described in section 4.1.9. This version of SCOOP compares the HHsearch results of an all against all search of the fold library sequences to those generated from CASP candidate sequences. 7. Discussion Over the past two decades, the number of protein sequences has increased exponentially and here, for the first time, the changing nature of this sequence landscape has been investigated with respect to one of the most important tools in biological research: homology detection. The detection of closely related homologs is a trivial problem and was not directly addressed in this work. Instead, the focus was on the harder task of remote homology detection. The methods used to detect remote homologies rely upon the creation of profiles representing the patterns of conservation and propensity to substitute residues within homologous groups. The composition of these profiles varies depending upon the sequences available in the databases used in their creation. The results in 5.1.5 show that in each year, the profiles capture different homologous relationships as the sequence landscape changes, resulting in either a net gain or loss of homologs in comparison with the previous year. When the absolute number of homologs detected each year is compared we see a clear plateauing in our ability to detect remote homologs with PSIBLAST (5.1.1). The extra information provided by the massive increase in protein sequences is not efficiently used by the profiles, during the period of 2004-2007 as sequence space more than triples, more homologs are lost than are gained each year, resulting in the trend shown in Figure 8. Using HHsearch, a cutting-edge method of homology detection, a net reduction is not observed, however the detection does seem to be slowing (Figure 11) and the same pattern of yearly gains and losses is seen (Figure 13). When a network was created from all of the significant PSIBLAST hits of 2007 (section 5.1.10), it was clear that there were significant differences in profiles starting from different points in the sequence space. It appears that profiles are highly sensitive to sequence landscape used during their construction and simply adding more sequences as we are currently doing will not necessarily result in an improvement in homology detection. The underlying reasons behind this trend may have already been partially uncovered. Li et-al describe a situation they term the 'profile trap' (Li et al. 2002) where a large protein family is composed of several sub-families of uneven sizes. They argue that a profile initiated from a member of a smaller sub-family could be dominated by proteins from the larger subfamily, diverging away from the original sequence and its close homologs. Such a phenomenon is exacerbated by large amounts of redundancy in the databases, adding density to specific regions of sequence space. The negative impact of redundancy was demonstrated in 5.1.6 where it was shown that non redundant databases perform better than their redundant counterparts. Another possible cause was shown in 5.1.7, where it was noted that the PSIBLAST profiles tended to become flatter over time. This could be caused by profiles attempting to represent highly diverse superfamilies. Li et-al propose an intermediate profile search procedure where the results from one a PSIBLAST iteration are clustered according to set sequence identity and each cluster is then aligned and turned in to a profile used to initialize another PSIBLAST round. This method improved PSIBLAST results by ~30% in their test set. The cause of this improvement could have been due to combating both of the above issues, stopping highly redundant sequences from dominating profiles whilst allowing separate sub families of diverse superfamiles to be represented without flattening profiles. The increase in protein sequences show no sign of slowing, on the contrary with large scale metagenomic sequencing projects(Williamson et al. 2008), we are likely to see an acceleration in sequencing. The results suggest that not only will this extra sequence data not be beneficial for homology detection; it could be actively harmful unless new methods are developed to gain from this increase. As detecting homologs is an important tool in annotating proteins (as described in section 4.1), the prediction of function and structure is likely to already been and is likely to continue to be adversely affected by the trend discovered here. This was highlighted by the results in 5.1.3 which shows that detection within the same SCOP family group is also adversely effected. The future work described in the next section will attempt to address the problems uncovered in the above work, improving homology detection and applying it to structure and function prediction. 8. Future plans The discoveries from the above work will be utilized to improve methods of remote homology prediction. 1. Methods of ancestral reconstruction will be assessed and then utilized to shrink the distance between sequences in test sequence databases. Initially, this will be used for defined diverse SCOP superfamilies. 2. A system will be developed using multiple linked profiles to cover the sequence space required to identify homologs. 3. The additional of CS-BLAST 4.1.8, will be assessed, in the hope of adding its extra power to the homology detection pipeline. 4. The structural Bioinformatics group has recently been highly successful in the CASP assessment of structural and functional prediction. The CASP structural prediction server utilized a non redundant database combined with an HHsearch and SCOOP method due to the findings of the work described here. The success in function prediction was primarily down to a manual alignment of structural homologs and expert assessment of annotation transfer using binging site and conservation. Improvements made in homology detection using points 1-3 will be applied to both systems, improving the structural prediction server and automating a function prediction server, by adding the extra power of remote homology detection and reducing the input required by an expert user. 9. References Altschul, S.F. et al., 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-10. Altschul, S. et al., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res., 25(17), 3389-3402. Bateman, A. & Finn, R.D., 2007. SCOOP: a simple method for identification of novel protein superfamily relationships. Bioinformatics (Oxford, England), 23(7), 809-14. Berman, H.M. et al., 2000. The Protein Data Bank. Nucl. Acids Res., 28(1), 235-242. Biegert, A. & Soding, J., 2009. Sequence context-specific amino acid similarities: a powerful paradigm for protein sequence comparison. PNAS (to be published). Cai, W., Pei, J. & Grishin, N.V., 2004. Reconstruction of ancestral protein sequences and its applications. BMC Evolutionary Biology, 4, 33. DeFeo-Jones, D. et al., 1983. ras-Related gene sequences identified and isolated from Saccharomyces cerevisiae. Nature, 306(5944), 707-9. Emes, R.D., 2008. Inferring function from homology. Methods in Molecular Biology (Clifton, N.J.), 453, 149-68. Gough, J. et al., 2001. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology, 313(4), 903-919. Hagberg, A.A., Schult, D.A. & Swart, P.J., 2008. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008). Pasadena, CA USA, pp. 11–15. Hawkins, T. et al., 2009. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins, 74(3), 566-82. Henikoff, S. & Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89(22), 10915-10919. Hunter, S. et al., 2009. InterPro: the integrative protein signature database. Nucleic Acids Research, 37(Database issue), D211-5. Jensen, R., 2001. Orthologs and paralogs - we need to get it right. Genome Biology, 2(8), interactions1002.1-interactions1002.3. Karplus, K., Barrett, C. & Hughey, R., 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics (Oxford, England), 14(10), 846-56. Lee, D., Redfern, O. & Orengo, C., 2007. Predicting protein function from sequence and structure. Nature Reviews. Molecular Cell Biology, 8(12), 995-1005. Li, W. & Godzik, A., 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England), 22(13), 1658-9. Li, W., Jaroszewski, L. & Godzik, A., 2002. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Engineering, 15(8), 6439. Liolios, K. et al., 2007. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucl. Acids Res., gkm884. Martin, D.M.A., Berriman, M. & Barton, G.J., 2004. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics, 5, 178. Moult, J. et al., 2003. Critical assessment of methods of protein structure prediction (CASP)round V. Proteins: Structure, Function, and Genetics, 53(S6), 334-339. Murzin, A.G. et al., 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4), 53640. Noble, W.S. et al., 2005. Identifying remote protein homologs by network propagation. The FEBS Journal, 272(20), 5119-28. Orengo, C.A. & Thornton, J.M., 2005. Protein families and their evolution-a structural perspective. Annual Review of Biochemistry, 74, 867-900. Park, J. et al., 2000a. RSDB: representative protein sequence databases have high information content. Bioinformatics (Oxford, England), 16(5), 458-64. Park, J. et al., 2000b. RSDB: representative protein sequence databases have high information content. Bioinformatics (Oxford, England), 16(5), 458-64. Park, J. et al., 1998. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal of Molecular Biology, 284(4), 1201-10. Redfern, O.C., Dessailly, B. & Orengo, C.A., 2008. Exploring the structure and function paradigm. Current Opinion in Structural Biology, 18(3), 394-402. Rost, B., 2002. Enzyme Function Less Conserved than Anticipated. Journal of Molecular Biology, 318(2), 595-608. Sandhya, S. et al., 2005. Assessment of a rigorous transitive profile based search method to detect remotely similar proteins. Journal of Biomolecular Structure & Dynamics, 23(3), 283-98. Söding, J., 2005. Protein homology detection by HMM-HMM comparison. Bioinformatics (Oxford, England), 21(7), 951-60. Suzek, B.E. et al., 2007. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 23(10), 1282-1288. Tatusov, R.L., Koonin, E.V. & Lipman, D.J., 1997. A genomic perspective on protein families. Science (New York, N.Y.), 278(5338), 631-7. Todd, A.E., Orengo, C.A. & Thornton, J.M., 2001. Evolution of function in protein superfamilies, from a structural perspective. Journal of Molecular Biology, 307(4), 1113-1143. Wass, M.N. & Sternberg, M.J.E., 2008. ConFunc--functional annotation in the twilight zone. Bioinformatics (Oxford, England), 24(6), 798-806. Wheals, A.E., 1985. Oncogene homologues in yeast. BioEssays, 3(3), 108-112. Williamson, S.J. et al., 2008. The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples. PLoS ONE, 3(1), e1456. Wu, C.H. et al., 2006. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Research, 34(Database issue), D187-91.