GOPoster - Computational Biology and Informatics Laboratory

advertisement
Predicting Gene Ontology Functions from ProDom and CDD protein domains
J. Schug, J. Mazzarelli, S. Diskin, B. Brunk, C.J. Stoeckert, Jr.
Computational Biology and Informatics Laboratory, University of Pennsylvania
Abstract:
A heuristic algorithm for associating Gene Ontology (GO)-defined molecular functions
to protein domains as listed in ProDom and CDD is described. The algorithm generates
rules for function-domain association based on the intersection of functions assigned to
gene products by GO that contain ProDom and/or CDD domains at varying levels of
sequence similarity. The hierarchical nature of GO molecular functions is incorporated
into rule generation. Manual review of a subset of the rules generated indicates an
accuracy rate of 90-95% for ProDom and approximately 82% for CDD. The utility of
these associations is that any novel sequence can be assigned a putative function if
sufficient similarity exists to a ProDom or CDD domain for which one or more GO
functions has been associated. Although functional assignments are increasingly being
made for gene products from model organisms, it is likely that the needs of investigators
will continue to outpace the efforts of curators, particularly for non-model organisms.
The function-domain rules were applied to the mouse transcriptome, and the distribution
of major categories was similar to those reported for Drosophila. A comparison with
other methods in terms of coverage and agreement was performed. A file of the domainfunction associations is available upon request.
INTRODUCTION
The function of a gene product and its role in cellular processes are ultimately based on
the translated sequence. Comparison of sequences is often used for making such
assignments, particularly focusing on protein domains. Functional domains tend to be
modular in sequence, i.e., they can also be considered as protein domains. We make the
assumption that the functions of a protein are determined by the set of the functional
domains it contains. The simplest possible model is that each domain independently
contributes a function. More complicated models would determine function from pairs
of domains, triples of domains, and other more complicated combinations of domains.
Our approach is to start with the simplest model and include more complicated models
only when necessary.
We use proteins that have been annotated by the maintainers of GO to learn domain
associations using a heuristic algorithm. Our method utilizes the hierarchical nature of
the GO ontology to associate functions to domains based on the (non-ideal) intersection
of the functions assigned to the proteins which have similarity to the domain. Use of a
functional ‘hierarchy’ allows rules to be as general as necessary, but also as specific as
possible. All rule sets and predictions are stored in our data warehouse, GUS (Genomic
Unified Schema)(2), along with their supporting evidence.
METHODS
Figure 1. Illustration of the approach used to assign ProDom domains to Gene Ontology (GO) functions. Proteins (gene products) are
assigned a molecular function by members of the Gene Ontology Consortium. BLAST similarities are used to associate the GOassigned proteins to ProDom domains. A heuristic algorithm is then used to assign GO functions to ProDom domains (learned
associations). The process followed for CDD is similar and uses RPS-Blast.
Rule Generation Algorithm
1. BLAST GO-annotated proteins against ProDom and CDD. Only keep results with p-values <= 10e-5.
2. For each ProDom AND CDD domain:
a. Generate a list of proteins and their p-values from the BLAST runs. Sort the list according to pvalue. If there are no proteins on the list, then generate a “no protein” rule and go on to the next
domain.
b. Go through the list to generate a rule for the domain.
i. Assign a function(s) to the domain based on the best p-value. This is a “one protein” rule. If
there are no more proteins, go on to the next domain.
ii. Consider the next protein on list with those above it. For these proteins, go through the rule
generators in the order listed (below) until the rule conditions are met. Assign that rule to
the domain at the p-value for the lowest protein on the list considered. Repeat this step
until there are no more proteins on the list and go on to the next domain.
(a)
1. Single Function
2. Consensus Leaf
3. Near Consensus Leaf
4. Near Ancestor Consensus
all proteins have the same single function.
has multiple functions but shares a common function with
other proteins.
has multiple functions but shares a common function with at
least 75% of other proteins.
go up in function hierarchy until find a common function
with at least 75% of other proteins.
(b)
Figure 2. Algorithm for assigning a GO function to a protein domain. Part (a) is the algorithm and part (b) are the rule types.
Function Prediction Algorithm
1. BLAST sequences (NA or AA) against ProDom and CDD.
2. For each query sequence having a p-value/e-value hit with p-value/e-value <= 10e-5:
a. Generate a list of the domains hit that have an AAMotifGOFunctionRuleSet generated.
b. Iterate list generated in (a) to predict GO Functions for the novel sequence:
i.
If similarity p-value is better that the rule p-value threshold used to generate the rule, then
apply the rules to the sequence and continue to next domain (rule set) in list
ii.
If the similarity p-value satisfies the p-value ratio(1) set, then apply the rules to the
sequence and continue to the next domain.
iii. If the similarity p-value satisfies the p-value threshold (2),, then apply the rules to the
sequence and continue to the next domain.
iv. Continue to the next domain, this similarity does not satisfy condition required to
apply rules.
(a)
1. pv-ratio
2. pv-threshold
–log(sim pv) / -log(rule pv), used to vary the acceptance of similarities that
are close to p-value threshold of rule.
The p-value threshold is used as a means to avoid missing predictions when
the p-values are very low, but the p-value ratio is not met .
(b)
Figure 3. Algorithm for predicting GO functions for novel sequences. Part (a) is the algorithm and part (b) is a description of the
pertinent parameters.
MATERIALS
Data Sources for Rule Generation
GO Ontology Version
Description
Version
v2.61
Function Ontology Version
Table 1. GO Function Ontology Version
Description
GO Gene Association Versions
Version GO Function Assignments Loaded into
GUS
Mouse Gene Associations
v1.32
5408 terms, 21031 ancestors
Fly Gene Associations
v1.28
3961 terms, 16624 ancestors
v. 1.302
5872 terms, 16124 ancestors
Yeast Gene Associations
Table 2. GO Association Versions (GO ontologies and associations obtained from www.geneontology.org.)
Description
Sequence Source Databases
Version
Number of Sequences Loaded into
GUS
SwissProt
v39.22
trEMBL (subset)
v17.0
98739
911
ProDom
2001.1
271051 (95518 with > 1 contained sequence)
Flybase (aa_gadfly_dros)
Release2
13288
SGD (Translated ORFs)
Downloaded
June 28, 2001
6358
Table 3. Sequence Source Databases
Data Sources for Function Prediction
Sequence Source Databases
Description
Version
SwissProt
v39.22
musDOTS
wormpep54
Development
Allgenes
54
ATH1_pep
June 13, 2001
Number of Sequences Loaded into GUS
98739
363523 assemblies / 65610 “Genes”
19774
25009
Table 4. Sequence Source Databases
Summary of Blast and RPS-Blast
Program
Parameters
Queries
Search
Database
WUBlastp
WUBlastp
WUBlastx
WUBlastp
WUBlastp
RPS-Blast
RPS-Blast
RPS-Blast
RPS-Blast
RPS-Blast
-wordmask seg+xnu
W=3 T=1000'
-wordmaskseg+xnu
W=3 T=1000
-wordmask
seg+xnu W=3 T=1000
-wordmask seg+xnu
W=3 T=1000
-wordmask seg+xnu
W=3 T=1000
Defaults
Defaults
-p=F
Defaults
Defaults
GO-Associated Proteins in GUS
ProDom
Number of Query
Sequences With
Similarities Loaded into
GUS
Rule Generation
20946
SwissProt
ProDom
Prediction
96438
MusDOTS
ProDom
Prediction
71460
Wormpep54 in GUS
ProDom
Prediction
19499
A.Thaliana Translated ORFs in GUS
ProDom
Prediction
24696
All GO-Associated Proteins in GUS
SwissProt
MusDOTS
A.Thaliana Translated ORFs in GUS
C.Elegan (wormpep54)
CDD
CDD
CDD
CDD
CDD
Rule Generation
Prediction
Prediction
Prediction
Prediction
16950
72650
37448
12878
9882
Table 5. Summary of Blast results
Reason for
Execution
RESULTS
Rule Groups and Rule Type Distributions
CDD No IEA
ProDom No IEA
Single Function
Total Rule Set Count = 1427
Total Rule Set Count = 11113
CDD ALL
ProDom ALL
One Similar GO Protein
Consensus Leaf
Near Consensus Leaf
Near Ancestor
Consensus
Total Rule Set Count = 1862
Total Rule Set Count = 19785
Analysis of Rule Accuracy
Manual review of approximately 1000 of the 7299 original ProDom rules indicated an accuracy rate of 9095%. A small sampling of the newly generated rules (approximately 50) for ProDom or CDD were
examined to access their accuracy. A range of different rules types were examined as well as a range of pvalues associated with the rules. When the ProDom rules were examined, an accuracy of 91% was found
while for the CDD rules the accuracy was 82%. More ProDom and CDD rules need to be examined to
substantiate these percentages.
Comparison to Interpro GO Function Mappings
Coverage of ProDom and Pfam Domains Mapped
To InterPro
Coverage
Percentage
100
75
50
50
49
44
43
InterPro
CBIL
25
0
ProDom
Pfam
Method
Existence of GO Association for
ProDom Domain Mapped To InterPro
29%
36%
Existence of GO Association for Pfam
Domain Mapped To InterPro
27%
34%
Neither have Assoc.
InterPro not CBIL
CBIL not InterPro
14%
21%
89% Agreement
Avg. Depth =3.26
22%
Both have Assoc.
17%
81% Agreement
Avg. Depth =3.67
Summary of Predicted Functions
Function Prediction Summary
Dataset
Rule Group
Number of Entries with
Prediction(s)
39193
Number of Predictions
% Coverage
Swiss-Prot
NoIEA_GO_ProDom
196491
39.7%
Swiss-Prot
All_GO_ProDom
47801
257905
46.4%
Swiss-Prot
NoIEA_GO_CDD
40727
176202
41.2%
Swiss-Prot
All_GO_CDD
44520
210278
45.1%
Swiss-Prot
UNION NO IEA
47616
246543
48.2%
Swiss-Prot
UNION ALL
53444
301001
54.1%
MusDoTS
ProDom_NoIEA
17754 (9776 “genes”)
74601
14.9%
MusDoTS
ProDom_All_GO
23543 (9916“genes”)
94347
15.1%
MusDoTS
CDD_NoIEA
17367 (8759 “genes”)
69911
13.3%
MusDoTS
CDD_All_GO
18545 (8715 “genes”)
75479
13.3%
MusDoTS
UNION NOIEA
24198(11195 “genes”)
145423
17.1%
MusDoTS
UNION All
11769 “genes”
163746
17.9%
A.Thaliana
Union NoIEA
9223
56749
36.9%
A.Thaliana
Union All
10995
62798
44.0%
C.Elegan
Union NoIEA
8003
38357
40.5%
C.Elegan
Union All
9318
38758
47.1%
Top-Level Function Prediction Distribution (Prodom vs. CDD)
musDOTS Top Level Function Prediction Distribution (ProDom vs. CDD)
Percentage of Predictions
0
5
10
15
20
25
30
35
enzyme
nucleic acid binding
ligand binding or carrier
transporter
signal transducer
structural protein
GO Function
cell adhesion molecule
chaperone
defense/immunity protein
enzyme inhibitor
cell cycle regulator
motor
microtubule binding
enzyme activator
storage protein
apoptosis regulator
other
CDD
ProDom
Coverage Analysis and Comparison To Other Methods
Function Prediction Coverage
Mouse "Gene" Prediction
Number of Mouse
"Genes" with Prediction
Number of SwissProt
with Prediction
SwissProt Function Prediction
60000
50000
40000
30000
20000
10000
0
ProDom
CDD
14000
12000
10000
8000
6000
4000
2000
0
UNION
Domain Rule Sets Used
ProDom
CDD
UNION
Domain Rule Sets Used
Prediction Coverage Comparison
100
80
Coverage
Percentage
CBIL
60
Alternative Method
40
20
0
SwissProt A.Thaliana C.Elegan
Agreement with other Computational Methods
SwissProt Function Prediction Agreement
at Top Level
CBIL vs. Compugen
CBIL Agreement with Other Methods
WormBase
15%
12%
TAIR
6%
Agree
Compugen Incorrect
15%
CBIL Incorrect
70%
88% Agree
Prediction Agreement Assessment of Compugen vs. CBIL at the top level of GO
function ontology. Agreement is evaluated at the level of an indivudual rule.
94% Agree
Prediction Agreement Assessment of CBIL vs. WormBase and TAIR at the top level of GO function ontology.
Agreement is evaluated at the sequence level. The two methods agree if share a common function for a gene product.
Percentages are based on subset of gene products for which both methods predict a function .
The differences between the SwissProt predictions made by ProDom/CDD (CBIL) versus the Compugen predictions (CGEN)
were manually evaluated by inspecting the annotation associated with the SwissProt entry. Of the 30% that 15% of the
ProDom/CDD or Compugen predictions were incorrect. From the manual inspection, general performance observations were
made. The CBIL rules performed better predicting the parents, transporter and signal transducer, whereas Compugen performed
better predicting the parent enzyme.
Particular examples are illustrated below:
P79350, OPRM_BOVIN, MU-TYPE OPIOID RECEPTOR (MOR-1)
CBIL : P79350 4871 signal transducer
CGEN: P79350 3824 enzyme
P26431, NAH1_RAT, SODIUM/HYDROGEN EXCHANGER 1 (NA(+)/H(+) EXCHANGER 1)(NHE-1)
CBIL: P26431 5215 transporter
CGEN: P26431 3824 enzyme
Q57290, Y740_HAEIN, PROBABLE PHOSPHOMANNOMUTASE (EC 5.4.2.8) (PMM)
CBIL : Q57290 5554 molecular_function unknown
CGEN: Q57290 3824 enzyme
CONCLUSION
The heuristic algorithm presented for associating Gene Ontology (GO)-defined
molecular functions to protein domains as listed in ProDom and CDD performed well.
Initial accuracy of the domain-function rules is estimated to be 90-95% for ProDom and
approximately 82% for CDD; additional review is necessary. Future GO annotations
may provide useful test sets for this analysis. The CBIL algorithm was comparable to
other computational methods in terms of coverage and agreement of predictions. The
use of multiple domain databases helped to increase the coverage slightly with little
domain sensitivity.
References and Acknowledgements:
REFERENCES
(1)
Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29
(2)
S. Davidson, J. Crabtree, B.Brunk, J. Schug, V. Tannen, C. Overton and C. Stoeckert 2001, IBM Systems Journal of Life Sciences. In
press.
(3)
The InterPro database, an integrated documentation resource for protein families, domains and functional sites. R. Apweiler, et al.
(2001) Nucleic Acids Research. 29: 37-40
WEBSITE RESOURCES
Gene Ontology Consortium:
ProDom:
CDD and RPS-Blast:
Swiss-Prot
Allgenes Index:
http://www.geneontology.org
http://www.toulouse.inra.fr/prodom.html
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
http://www.expasy.ch/sprot/sprot-top.html
http://www.allgenes.org
FlyBase:
http://flybase.bio.indiana.edu/
Saccharomyces cerevisiae Genome Database (SGD):
http://genome-www.stanford.edu/Saccharomyces/
Mouse Genome Database (MGD) & Gene Expression Database (GXD): http://www.informatics.jax.org/
ACKNOWLEDGMENTS:
This work was funded in part by grants from the DOE (DE-FG02-00ER62893) and NIH (RO1-HG01539).
Download