GO-introduction

advertisement
http://www.geneontology.org/index.
shtml
An Introduction to the
Gene Ontology
(GO)
The Gene Ontology project provides a controlled
vocabulary to describe gene and gene product
attributes in any organism.
http://www.geneontology.org/index.sh
tml
• Search the Gene Ontology
DatabaseSearch for genes, proteins
or GO terms using AmiGO:gene or
protein name GO term or IDAmiGO
is the official GO browser and
search engine. Browse the Gene
Ontology with AmiGO.
• •What does the Gene Ontology
Consortium do?•Terms in the Gene
Ontology•Species-specific
terms•Obsolete terms•The
Ontologies•Cellular
component•Biological process•Molecular
function•Ontology structure•What GO is
NOT•Annotation and
tools•Downloads•Beyond GO•Crossproducts•Mappings to other classification
systems•Contributing to GO
What does the Gene
Ontology Consortium do?
•
•
•
•
•
•
Biologists currently waste a lot of time and effort in searching for all of the
available information about each small area of research.
This is hampered further by the wide variations in terminology that may be
common usage at any given time, which inhibit effective searching by both
computers and people.
For example, if you were searching for new targets for antibiotics, you might
want to find all the gene products that are involved in bacterial protein
synthesis, and that have significantly different sequences or structures from
those in humans. If one database describes these molecules as being involved
in 'translation', whereas another uses the phrase 'protein synthesis', it will be
difficult for you - and even harder for a computer - to find functionally
equivalent terms.
The Gene Ontology (GO) project is a collaborative effort to address the need for
consistent descriptions of gene products in different databases.
The project began as a collaboration between three model organism databases,
FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse
Genome Database (MGD), in 1998.
Since then, the GO Consortium has grown to include many databases, including
several of the world's major repositories for plant, animal and microbial
genomes. See the GO Consortium page for a full list of member organizations.
•
•
•
•
•
•
•
•
•
•
The GO project has developed three structured controlled vocabularies
(ontologies) that describe gene products in terms of their associated biological
processes, cellular components and molecular functions in a speciesindependent manner.
There are three separate aspects to this effort:
first, the development and maintenance of the ontologies themselves;
second, the annotation of gene products, which entails making associations
between the ontologies and the genes and gene products in the collaborating
databases;
and third, development of tools that facilitate the creation, maintenance and use
of ontologies.
The use of GO terms by collaborating databases facilitates uniform queries
across them.
The controlled vocabularies are structured so that they can be queried at
different levels:
for example, you can use GO to find all the gene products in the mouse genome
that are involved in signal transduction,
or you can zoom in on all the receptor tyrosine kinases.
This structure also allows annotators to assign properties to genes or gene
products at different levels, depending on the depth of knowledge about that
entity.
Terms in the Gene Ontology
•
•
•
•
•
•
The building blocks of the Gene Ontology are the terms, so what makes up a GO
term?
Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and
a term name, e.g. cell, fibroblast growth factor receptor binding or signal
transduction.
Each term is also assigned to one of the three ontologies, molecular function,
cellular component or biological process.
The majority of terms have a textual definition, with references stating the
source of the definition.
If any clarification of the definition or remarks about term usage are required,
these are held in a separate comments field.
Many GO terms have synonyms; GO uses 'synonym' in a loose sense, as the
names within the synonyms field may not mean exactly the same as the term
they are attached to. Instead, a GO synonym may be broader or narrower than
the term string; it may be a related phrase; it may be alternative wording,
spelling or use a different system of nomenclature; or it may be a true
synonym. This flexibility allows GO synonyms to serve as valuable search aids,
as well as being useful for applications such as text mining and semantic
matching. The relationship of the synonym to the term is recorded within the
GO file.
•
•
•
The scope of the Gene Ontology overlaps with a number of other databases, and
in cases where a GO term is identical in meaning to an object in another
database, a database cross reference is added to the term. These cross
references can also be downloaded from the mappings to GO page.
Species-specific termsThe Gene Ontology aims to provide a controlled vocabulary
that can be used to describe any organism; nevertheless, many functions,
processes and components are not common to all life forms. The convention is
to include any term that can apply to more than one taxonomic class of
organism. To specify the class of organisms to which a term is applicable, GO
uses the designator sensu, 'in the sense of'; for example, trichome differentiation
(sensu Magnoliophyta) represents the differentiation of plant hair cells
(trichomes).
Obsolete termsOccasionally, a term is found that is outside the scope of GO, is
misleadingly named or defined, or describes a concept that would be better
represented in another way. Rather than delete the term, it is deprecated or
made obsolete. The term and ID still exist in the GO database, but the term is
marked as obsolete, and a comment added, giving a reason for the obsoletion
and recommending alternative terms where appropriate.
The Ontologies
• The three organizing principles of GO are cellular
component, biological process and molecular function. A
gene product might be associated with or located in one
or more cellular components; it is active in one or more
biological processes, during which it performs one or
more molecular functions. For example, the gene
product cytochrome c can be described by the molecular
function term oxidoreductase activity, the biological
process terms oxidative phosphorylation and induction of
cell death, and the cellular component terms
mitochondrial matrix and mitochondrial inner membrane.
Cellular component
• A cellular component is just that, a
component of a cell, but with the
proviso that it is part of some larger
object; this may be an anatomical
structure (e.g. rough endoplasmic
reticulum or nucleus) or a gene product
group (e.g. ribosome, proteasome or a
protein dimer). See the documentation on
the cellular component ontology for more
details.
Biological process
• A biological process is series of events accomplished by one or
more ordered assemblies of molecular functions. Examples of
broad biological process terms are cellular physiological
process or signal transduction.
• Examples of more specific terms are pyrimidine metabolism or
alpha-glucoside transport.
• It can be difficult to distinguish between a biological process
and a molecular function, but the general rule is that a
process must have more than one distinct steps.A biological
process is not equivalent to a pathway; at present, GO does
not try to represent the dynamics or dependencies that would
be required to fully describe a pathway.Further information
can be found in the process ontology documentation.
Molecular function
•
•
•
•
•
•
Molecular function describes activities, such as catalytic or binding activities,
that occur at the molecular level.
GO molecular function terms represent activities rather than the entities
(molecules or complexes) that perform the actions, and do not specify where or
when, or in what context, the action takes place.
Molecular functions generally correspond to activities that can be performed by
individual gene products, but some activities are performed by assembled
complexes of gene products.
Examples of broad functional terms are catalytic activity, transporter activity, or
binding;
examples of narrower functional terms are adenylate cyclase activity or Toll
receptor binding.
It is easy to confuse a gene product name with its molecular function, and for
that reason many GO molecular functions are appended with the word "activity".
The documentation on gene products explains this confusion in more depth. The
documentation on the function ontology explains more about GO functions and the rules
governing them.
Ontology structure
•
•
•
•
•
•
•
•
The terms in an ontology are linked by two relationships, is_a and
part_of. is_a is a simple class-subclass relationship,
where A is_a B means that A is a subclass of B; for example, nuclear
chromosome is_a chromosome.
part_of is slightly more complex; C part_of D means that whenever C is
present, it is always a part of D, but C does not always have to be
present. An example would be nucleus part_of cell; nuclei are always
part of a cell, but not all cells have nuclei.
The ontologies are structured as directed acyclic graphs,
which are similar to hierarchies but differ in that
a child, or more specialized, term can have many parents, or less
specialized, terms.
For example, the biological process term hexose biosynthesis has two
parents, hexose metabolism and monosaccharide biosynthesis. This is
because biosynthesis is a subtype of metabolism, and a hexose is a
type of monosaccharide. When any gene involved in hexose
biosynthesis is annotated to this term, it is automatically annotated to
both hexose metabolism and monosaccharide biosynthesis,
because every GO term must obey the true path rule: if the child term
describes the gene product, then all its parent terms must also apply
to that gene product.
What GO is NOT
•
It is important to clearly state the scope of GO, and what it does and does not
cover. The ontologies section explains the domains covered by GO; the following
areas are outside the scope of GO, and terms in these domains would not
appear in the ontologies.•Gene products: e.g. cytochrome c is not in the
ontologies, but attributes of cytochrome c, such as oxidoreductase activity,
are.•Processes, functions or components that are unique to mutants or
diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not
the normal function of any gene.•Attributes of sequence such as intron/exon
parameters: these are not attributes of gene products and will be described in a
separate sequence ontology (see the OBO website for more information).•Protein
domains or structural features.•Protein-protein interactions.•Environment,
evolution and expression.•Anatomical or histological features above the level of
cellular components, including cell types.GO is not a database of gene
sequences, nor a catalog of gene products. Rather, GO describes how gene
products behave in a cellular context.GO is not a dictated standard, mandating
nomenclature across databases. Groups participate because of self-interest, and
cooperate to arrive at a consensus.GO is not a way to unify biological databases
(i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards
unification, but is not, in itself, sufficient. Reasons for this include the
following:•Knowledge changes and updates lag behind.•Individual curators
evaluate data differently. While we can agree to use the word 'kinase', we must
also agree to support this by stating how and why we use 'kinase', and
consistently apply it. Only in this way can we hope to compare gene products
and determine whether they are related.•GO does not attempt to describe every
aspect of biology; its scope is limited to the domains described above.Back to
topAnnotation and tools
•
How do the terms in GO become associated with their appropriate
gene products? Collaborating databases annotate their genes or gene
products with GO terms, providing references and indicating what
kind of evidence is available to support the annotations. More
information can be found in the GO Annotation Guide.If you browse any
of the contributing databases, you'll find that each gene or gene
product has a list of associated GO terms. Each database also
publishes downloadable files containing these associations; these can
be downloaded from the GO annotations page. You can browse the
ontologies using a range of web-based browsers. A full list of these,
and other tools for analyzing gene function using GO, is available on
the GO Tools section.In addition, the GO consortium has prepared GO
slims, 'slimmed down' versions of the ontologies that allow you to
annotate genomes or sets of gene products to gain a high-level view
of gene functions. Using GO slims you can, for example, work out
what proportion of a genome is involved in signal transduction,
biosynthesis or reproduction. See the GO Slim Guide for more
information.
Beyond GO
•
GO allows us to annotate genes and their products with a limited set
of attributes. For example, GO does not allow us to describe genes in
terms of which cells or tissues they're expressed in, which
developmental stages they're expressed at, or their involvement in
disease. It is not necessary for GO to do these things because other
ontologies are being developed for these purposes. The GO
consortium supports the development of other ontologies and makes
its tools for editing and curating ontologies freely available. A list of
freely available ontologies that are relevant to genomics and
proteomics and are structured similarly to GO can be found at the
Open Biomedical Ontologies website . A larger list, which includes the
ontologies listed at OBO and also other controlled vocabularies that do
not fulfill the OBO criteria is available at the Ontology Working Group
section of the Microarray Gene Expression Data (MGED) Network site .
Download
•
All data from the GO project is freely available. You can download the
ontology data in a number of different formats, including XML and
mySQL, from the GO Downloads page. For more information on the
syntax of these formats, see the GO File Format Guide.If you need lists of
the genes or gene products that have been associated with a
particular GO term, the Current Annotations table tracks the number of
annotations and provides links to the gene association files for each of
the collaborating databases is available.
GO term enrichment
• Hypergeometric
• SGD : example using LA output
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Transitive functional annotation by shortest-path
analysis of gene expression data
PNAS | October 1, 2002 | vol. 99 | no. 20 | 12783-12788
Xianghong Zhou*, Ming-Chih J. Kao*, and Wing Hung Wong
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
Fig. 1. (A) Application of the shortest-path (SP) algorithm to gene expression data. Nine genes are
depicted in the graph. The distance between two genes is a decreasing function of their correlation. For
example, there are multiple expression dependence paths leading from gene a to gene e. Among them, the
shortest dependence path is a-b-c-d-e, with genes b, c, and d serving as the transitive genes. This is the
most parsimonious summary of the expression relationship between the terminal genes a and e. (B) Level
0 (L0) and level 1 (L1) matches of genes on the SP a-b-c-d-e defined according to their relationships in the
Gene Ontology (GO) classification tree. With respect to the terminal genes a and e, the transitive gene b is
a L0 match because it is annotated in the informative node where a and e are annotated; the transitive
gene c is a L1 match because it shares the same direct parent as the two terminal genes; the transitive
gene d is neither a L0 nor a L1 match.
Current methods for the functional analysis of microarray gene
expression data make the implicit assumption that genes with similar
expression profiles have similar functions in cells. However, among
genes involved in the same biological pathway, not all gene pairs show
high expression similarity. Here, we propose that transitive expression
similarity among genes can be used as an important attribute to link
genes of the same biological pathway. Based on large-scale yeast
microarray expression data, we use the shortest-path analysis to identify
transitive genes between two given genes from the same biological
process. We find that not only functionally related genes with correlated
expression profiles are identified but also those without. In the latter
case, we compare our method to hierarchical clustering, and show that
our method can reveal functional relationships among genes in a more
precise manner. Finally, we show that our method can be used to reliably
predict the function of unknown genes from known genes lying on the
same shortest path. We assigned functions for 146 yeast genes that are
considered as unknown by the Saccharomyces Genome Database and
by the Yeast Proteome Database. These genes constitute around 5% of
the unknown yeast ORFome.
Data Processing
• Saccharomyces cerevisiae gene expression profiles from the
Rosetta Compendium (6), which includes 300 deletion and drug
treatment experiments. Genes were annotated by using the
biological process ontology of Gene Ontology (GO) (7) provided
by the Saccharomyces Genome Database (SGD) (8).
•
•
After removing the genes without GO process annotation and the
20 genes for which there are less than 80 experimental measurements
in the Rosetta Compendium, we were left with 266 mitochondrial,
398 cytoplasmic, and 659 nuclear GO-annotated genes.
For each of the three sets of genes, we calculated the expression
similarities of all gene pairs {a, b} using Ca,b, the minimum of the
absolute value of leave-one-out Pearson correlation coefficient
estimates. This estimate is a measurement robust against single
experiment outliers and sensitive to overall similarities in expression
patterns.
Graph Construction and SP Computation.
•
•
We constructed three graphs, one for each set of the 266 mitochondrial genes,
the 398 cytoplasmic genes, and the 659 nuclear genes. In each graph, two
genes were assigned an edge if their absolute expression correlation Ca,b was
higher than = 0.6.
This cut-off, while conservative, nonetheless retains a sufficient number of
connected gene pairs in the graph. The edge length between vertices a and b is
da,b = f(Ca,b) = (1 Ca,b)k. The powering factor k is used to enhance the
differences between low and high correlations. Because the length of a path is
the sum of the individual edge lengths, by exaggerating the differences between
edge lengths, the SPs will be more likely to cover more transitive genes. Thus by
increasing k we gain more power to reveal transitive co-expression. We set
k = 6 because for k 6, the numbers of transitive genes stabilizes (detailed
results at www.biostat.harvard.edu/complab/SP/). To ensure the quality of SPs, we
consider only SPs with total path lengths <0.008.
Predicting the Functions of Unknown Genes.
•
•
•
•
•
•
•
•
We use the SP method to classify previously unannotated yeast genes by adding the
3,255 ORFs unknown to SGD into the graphs of known genes in the mitochondrial,
cytoplasmic, and nuclear compartments.
As before, an edge is constructed between two genes if their absolute expression correlation
is higher than 0.6.
For all pairs of known genes, we determine the SPs connecting them. For the purpose of
functional prediction, we would like to assign a putative function that is as specific as
possible to the gene. Given all known genes on a SP, we achieve this by tracing back their
annotations along the GO process tree and finding their lowest common ancestor.
If the lowest ancestral node is at least 4 levels below the root of the GO tree, that is, it
defines a sufficiently specific gene function, we then assign this function to the unknown
genes on the SP.
Analogous to the L0 and L1 matches, here the L0 prediction then corresponds to the lowest
common ancestor, and the L1 prediction to its direct parent. In this way, the function
represented by the lowest common ancestor can be more specific than that defined by the
informative nodes.
. For each predicted gene function, we provide both the number of support SPs from which
the prediction was derived and the number of unique known genes on those support SPs
(support genes). The more support genes there are, the more confidence we have in the
corresponding prediction.
Note that a gene can be assigned putative functions in multiple graphs, because many
genes are known to function in multiple cellular compartments.
Under two circumstances an unknown gene may be assigned with multiple functions: (i)
Because known genes on a SP may each have multiple functions, they may share several
lowest common ancestors in the GO tree. (ii) An unknown gene may reside in different SPs
with different lowest common ancestors
Download