Predicting Gene Ontology Functions from ProDom and CDD protein domains J. Schug, J. Mazzarelli, S. Diskin, B. Brunk, C.J. Stoeckert, Jr. Computational Biology and Informatics Laboratory, University of Pennsylvania Abstract: A heuristic algorithm for associating Gene Ontology (GO)-defined molecular functions to protein domains as listed in ProDom and CDD is described. The algorithm generates rules for function-domain association based on the intersection of functions assigned to gene products by GO that contain ProDom and/or CDD domains at varying levels of sequence similarity. The hierarchical nature of GO molecular functions is incorporated into rule generation. Manual review of a subset of the rules generated indicates an accuracy rate of 90-95% for ProDom and approximately 82% for CDD. The utility of these associations is that any novel sequence can be assigned a putative function if sufficient similarity exists to a ProDom or CDD domain for which one or more GO functions has been associated. Although functional assignments are increasingly being made for gene products from model organisms, it is likely that the needs of investigators will continue to outpace the efforts of curators, particularly for non-model organisms. The function-domain rules were applied to the mouse transcriptome, and the distribution of major categories was similar to those reported for Drosophila. A comparison with other methods in terms of coverage and agreement was performed. A file of the domainfunction associations is available upon request. INTRODUCTION The function of a gene product and its role in cellular processes are ultimately based on the translated sequence. Comparison of sequences is often used for making such assignments, particularly focusing on protein domains. Functional domains tend to be modular in sequence, i.e., they can also be considered as protein domains. We make the assumption that the functions of a protein are determined by the set of the functional domains it contains. The simplest possible model is that each domain independently contributes a function. More complicated models would determine function from pairs of domains, triples of domains, and other more complicated combinations of domains. Our approach is to start with the simplest model and include more complicated models only when necessary. We use proteins that have been annotated by the maintainers of GO to learn domain associations using a heuristic algorithm. Our method utilizes the hierarchical nature of the GO ontology to associate functions to domains based on the (non-ideal) intersection of the functions assigned to the proteins which have similarity to the domain. Use of a functional ‘hierarchy’ allows rules to be as general as necessary, but also as specific as possible. All rule sets and predictions are stored in our data warehouse, GUS (Genomic Unified Schema)(2), along with their supporting evidence. METHODS Figure 1. Illustration of the approach used to assign ProDom domains to Gene Ontology (GO) functions. Proteins (gene products) are assigned a molecular function by members of the Gene Ontology Consortium. BLAST similarities are used to associate the GOassigned proteins to ProDom domains. A heuristic algorithm is then used to assign GO functions to ProDom domains (learned associations). The process followed for CDD is similar and uses RPS-Blast. Rule Generation Algorithm 1. BLAST GO-annotated proteins against ProDom and CDD. Only keep results with p-values <= 10e-5. 2. For each ProDom AND CDD domain: a. Generate a list of proteins and their p-values from the BLAST runs. Sort the list according to pvalue. If there are no proteins on the list, then generate a “no protein” rule and go on to the next domain. b. Go through the list to generate a rule for the domain. i. Assign a function(s) to the domain based on the best p-value. This is a “one protein” rule. If there are no more proteins, go on to the next domain. ii. Consider the next protein on list with those above it. For these proteins, go through the rule generators in the order listed (below) until the rule conditions are met. Assign that rule to the domain at the p-value for the lowest protein on the list considered. Repeat this step until there are no more proteins on the list and go on to the next domain. (a) 1. Single Function 2. Consensus Leaf 3. Near Consensus Leaf 4. Near Ancestor Consensus all proteins have the same single function. has multiple functions but shares a common function with other proteins. has multiple functions but shares a common function with at least 75% of other proteins. go up in function hierarchy until find a common function with at least 75% of other proteins. (b) Figure 2. Algorithm for assigning a GO function to a protein domain. Part (a) is the algorithm and part (b) are the rule types. Function Prediction Algorithm 1. BLAST sequences (NA or AA) against ProDom and CDD. 2. For each query sequence having a p-value/e-value hit with p-value/e-value <= 10e-5: a. Generate a list of the domains hit that have an AAMotifGOFunctionRuleSet generated. b. Iterate list generated in (a) to predict GO Functions for the novel sequence: i. If similarity p-value is better that the rule p-value threshold used to generate the rule, then apply the rules to the sequence and continue to next domain (rule set) in list ii. If the similarity p-value satisfies the p-value ratio(1) set, then apply the rules to the sequence and continue to the next domain. iii. If the similarity p-value satisfies the p-value threshold (2),, then apply the rules to the sequence and continue to the next domain. iv. Continue to the next domain, this similarity does not satisfy condition required to apply rules. (a) 1. pv-ratio 2. pv-threshold –log(sim pv) / -log(rule pv), used to vary the acceptance of similarities that are close to p-value threshold of rule. The p-value threshold is used as a means to avoid missing predictions when the p-values are very low, but the p-value ratio is not met . (b) Figure 3. Algorithm for predicting GO functions for novel sequences. Part (a) is the algorithm and part (b) is a description of the pertinent parameters. MATERIALS Data Sources for Rule Generation GO Ontology Version Description Version v2.61 Function Ontology Version Table 1. GO Function Ontology Version Description GO Gene Association Versions Version GO Function Assignments Loaded into GUS Mouse Gene Associations v1.32 5408 terms, 21031 ancestors Fly Gene Associations v1.28 3961 terms, 16624 ancestors v. 1.302 5872 terms, 16124 ancestors Yeast Gene Associations Table 2. GO Association Versions (GO ontologies and associations obtained from www.geneontology.org.) Description Sequence Source Databases Version Number of Sequences Loaded into GUS SwissProt v39.22 trEMBL (subset) v17.0 98739 911 ProDom 2001.1 271051 (95518 with > 1 contained sequence) Flybase (aa_gadfly_dros) Release2 13288 SGD (Translated ORFs) Downloaded June 28, 2001 6358 Table 3. Sequence Source Databases Data Sources for Function Prediction Sequence Source Databases Description Version SwissProt v39.22 musDOTS wormpep54 Development Allgenes 54 ATH1_pep June 13, 2001 Number of Sequences Loaded into GUS 98739 363523 assemblies / 65610 “Genes” 19774 25009 Table 4. Sequence Source Databases Summary of Blast and RPS-Blast Program Parameters Queries Search Database WUBlastp WUBlastp WUBlastx WUBlastp WUBlastp RPS-Blast RPS-Blast RPS-Blast RPS-Blast RPS-Blast -wordmask seg+xnu W=3 T=1000' -wordmaskseg+xnu W=3 T=1000 -wordmask seg+xnu W=3 T=1000 -wordmask seg+xnu W=3 T=1000 -wordmask seg+xnu W=3 T=1000 Defaults Defaults -p=F Defaults Defaults GO-Associated Proteins in GUS ProDom Number of Query Sequences With Similarities Loaded into GUS Rule Generation 20946 SwissProt ProDom Prediction 96438 MusDOTS ProDom Prediction 71460 Wormpep54 in GUS ProDom Prediction 19499 A.Thaliana Translated ORFs in GUS ProDom Prediction 24696 All GO-Associated Proteins in GUS SwissProt MusDOTS A.Thaliana Translated ORFs in GUS C.Elegan (wormpep54) CDD CDD CDD CDD CDD Rule Generation Prediction Prediction Prediction Prediction 16950 72650 37448 12878 9882 Table 5. Summary of Blast results Reason for Execution RESULTS Rule Groups and Rule Type Distributions CDD No IEA ProDom No IEA Single Function Total Rule Set Count = 1427 Total Rule Set Count = 11113 CDD ALL ProDom ALL One Similar GO Protein Consensus Leaf Near Consensus Leaf Near Ancestor Consensus Total Rule Set Count = 1862 Total Rule Set Count = 19785 Analysis of Rule Accuracy Manual review of approximately 1000 of the 7299 original ProDom rules indicated an accuracy rate of 9095%. A small sampling of the newly generated rules (approximately 50) for ProDom or CDD were examined to access their accuracy. A range of different rules types were examined as well as a range of pvalues associated with the rules. When the ProDom rules were examined, an accuracy of 91% was found while for the CDD rules the accuracy was 82%. More ProDom and CDD rules need to be examined to substantiate these percentages. Comparison to Interpro GO Function Mappings Coverage of ProDom and Pfam Domains Mapped To InterPro Coverage Percentage 100 75 50 50 49 44 43 InterPro CBIL 25 0 ProDom Pfam Method Existence of GO Association for ProDom Domain Mapped To InterPro 29% 36% Existence of GO Association for Pfam Domain Mapped To InterPro 27% 34% Neither have Assoc. InterPro not CBIL CBIL not InterPro 14% 21% 89% Agreement Avg. Depth =3.26 22% Both have Assoc. 17% 81% Agreement Avg. Depth =3.67 Summary of Predicted Functions Function Prediction Summary Dataset Rule Group Number of Entries with Prediction(s) 39193 Number of Predictions % Coverage Swiss-Prot NoIEA_GO_ProDom 196491 39.7% Swiss-Prot All_GO_ProDom 47801 257905 46.4% Swiss-Prot NoIEA_GO_CDD 40727 176202 41.2% Swiss-Prot All_GO_CDD 44520 210278 45.1% Swiss-Prot UNION NO IEA 47616 246543 48.2% Swiss-Prot UNION ALL 53444 301001 54.1% MusDoTS ProDom_NoIEA 17754 (9776 “genes”) 74601 14.9% MusDoTS ProDom_All_GO 23543 (9916“genes”) 94347 15.1% MusDoTS CDD_NoIEA 17367 (8759 “genes”) 69911 13.3% MusDoTS CDD_All_GO 18545 (8715 “genes”) 75479 13.3% MusDoTS UNION NOIEA 24198(11195 “genes”) 145423 17.1% MusDoTS UNION All 11769 “genes” 163746 17.9% A.Thaliana Union NoIEA 9223 56749 36.9% A.Thaliana Union All 10995 62798 44.0% C.Elegan Union NoIEA 8003 38357 40.5% C.Elegan Union All 9318 38758 47.1% Top-Level Function Prediction Distribution (Prodom vs. CDD) musDOTS Top Level Function Prediction Distribution (ProDom vs. CDD) Percentage of Predictions 0 5 10 15 20 25 30 35 enzyme nucleic acid binding ligand binding or carrier transporter signal transducer structural protein GO Function cell adhesion molecule chaperone defense/immunity protein enzyme inhibitor cell cycle regulator motor microtubule binding enzyme activator storage protein apoptosis regulator other CDD ProDom Coverage Analysis and Comparison To Other Methods Function Prediction Coverage Mouse "Gene" Prediction Number of Mouse "Genes" with Prediction Number of SwissProt with Prediction SwissProt Function Prediction 60000 50000 40000 30000 20000 10000 0 ProDom CDD 14000 12000 10000 8000 6000 4000 2000 0 UNION Domain Rule Sets Used ProDom CDD UNION Domain Rule Sets Used Prediction Coverage Comparison 100 80 Coverage Percentage CBIL 60 Alternative Method 40 20 0 SwissProt A.Thaliana C.Elegan Agreement with other Computational Methods SwissProt Function Prediction Agreement at Top Level CBIL vs. Compugen CBIL Agreement with Other Methods WormBase 15% 12% TAIR 6% Agree Compugen Incorrect 15% CBIL Incorrect 70% 88% Agree Prediction Agreement Assessment of Compugen vs. CBIL at the top level of GO function ontology. Agreement is evaluated at the level of an indivudual rule. 94% Agree Prediction Agreement Assessment of CBIL vs. WormBase and TAIR at the top level of GO function ontology. Agreement is evaluated at the sequence level. The two methods agree if share a common function for a gene product. Percentages are based on subset of gene products for which both methods predict a function . The differences between the SwissProt predictions made by ProDom/CDD (CBIL) versus the Compugen predictions (CGEN) were manually evaluated by inspecting the annotation associated with the SwissProt entry. Of the 30% that 15% of the ProDom/CDD or Compugen predictions were incorrect. From the manual inspection, general performance observations were made. The CBIL rules performed better predicting the parents, transporter and signal transducer, whereas Compugen performed better predicting the parent enzyme. Particular examples are illustrated below: P79350, OPRM_BOVIN, MU-TYPE OPIOID RECEPTOR (MOR-1) CBIL : P79350 4871 signal transducer CGEN: P79350 3824 enzyme P26431, NAH1_RAT, SODIUM/HYDROGEN EXCHANGER 1 (NA(+)/H(+) EXCHANGER 1)(NHE-1) CBIL: P26431 5215 transporter CGEN: P26431 3824 enzyme Q57290, Y740_HAEIN, PROBABLE PHOSPHOMANNOMUTASE (EC 5.4.2.8) (PMM) CBIL : Q57290 5554 molecular_function unknown CGEN: Q57290 3824 enzyme CONCLUSION The heuristic algorithm presented for associating Gene Ontology (GO)-defined molecular functions to protein domains as listed in ProDom and CDD performed well. Initial accuracy of the domain-function rules is estimated to be 90-95% for ProDom and approximately 82% for CDD; additional review is necessary. Future GO annotations may provide useful test sets for this analysis. The CBIL algorithm was comparable to other computational methods in terms of coverage and agreement of predictions. The use of multiple domain databases helped to increase the coverage slightly with little domain sensitivity. References and Acknowledgements: REFERENCES (1) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29 (2) S. Davidson, J. Crabtree, B.Brunk, J. Schug, V. Tannen, C. Overton and C. Stoeckert 2001, IBM Systems Journal of Life Sciences. In press. (3) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. R. Apweiler, et al. (2001) Nucleic Acids Research. 29: 37-40 WEBSITE RESOURCES Gene Ontology Consortium: ProDom: CDD and RPS-Blast: Swiss-Prot Allgenes Index: http://www.geneontology.org http://www.toulouse.inra.fr/prodom.html http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml http://www.expasy.ch/sprot/sprot-top.html http://www.allgenes.org FlyBase: http://flybase.bio.indiana.edu/ Saccharomyces cerevisiae Genome Database (SGD): http://genome-www.stanford.edu/Saccharomyces/ Mouse Genome Database (MGD) & Gene Expression Database (GXD): http://www.informatics.jax.org/ ACKNOWLEDGMENTS: This work was funded in part by grants from the DOE (DE-FG02-00ER62893) and NIH (RO1-HG01539).