Gene Annotation and Network Inference by Phylogenetic

advertisement
Gene Annotation and Network Inference by
Phylogenetic Profiling
Jie Wu1, Zhenjun Hu2, Charles DeLisi1, 2
1
Department of Biomedical Engineering,
2
Bioinformatics Graduate Program,
Boston University,
44 Cummington St., Boston, MA, 02215, USA
Wu J: jiewu@bu.edu
Hu Z: zjhu@bu.edu
DeLisi C: delisi@bu.edu
Correspondence: Charles DeLisi. E-mail: delisi@bu.edu
Abstract
Background: Phylogenetic analysis is emerging as one of the most informative
computational methods for gene annotation, and identification of evolutionary and
functional modules. The effectiveness with which phylogenetic information can be
utilized depends in part on the availability of both an appropriate measure of
correlation between binary strings of arbitrary length and composition, and an
effective high performance method to use the correlate. Current phylogeneticallybased methods, though useful, perform at a level well below what is possible. Beyond
that, informative and standardized measures of performance are not in use, making
comparisons between different methods difficult; those that have been used convey
overly optimistic assessments of performance.
Results: We introduce a rigorous and general measure of correlation, a new category
enrichment based method for using it to assign genes to functional categories at
various levels of resolution, and a complete and rigorous performance assessment,
enabling full assessment of yet to be developed methods against our results. In a
systematic allocation test, more than 56% of the 1368 (766) KEGG annotated
orthologs were correctly assigned to one or more of the 3 most likely pathways. Of
the 602 putative false positives, 467 are annotated in the Gene Ontology (GO), and
280 of those share a GO category with the pathway genes at level 5 or higher. The
method is therefore estimated to correctly allocate between 1634 and 2404 of the
2918 previously unannotated genes to one or more functional categories, at a
relatively high level of resolution. A general comparison using mutual information as
a correlate, and standard guilt by association to draw inferences indicates that the
method used here has a positive predictive value nearly eightfold higher at
comparable coverage. The method also assigns accuracies to annotations of individual
genes, rather than just population averaged reliabilities.
Conclusions: The method serves as a general computational tool for annotating large
numbers of unknown genes, and uncovering evolutionary and functional modules. It
appears to perform substantially better than extant stand-alone high throughout
methods.
2
Background
One of the remarkable characteristics of the genomic era is that the solution to the
challenge of annotation posed by the rapid increase in sequence, comes in part from
the data itself; i.e. the availability of a large number of fully sequenced genomes
provides information that enables the development of new computational approaches
including domain fusion [1-3], chromosomal proximity [4]and phylogenetic profiling
[5-8].
Phylogenetic profiling, in its original form, was used to infer the function of a gene by
finding another gene of known function with an identical pattern of presence and
absence across a set of phylogenetically distributed genomes. Such restricted profiling,
requiring full profile identity, while accurate, has low coverage, assigning pathways to
114 of 1814 unknown orthologous proteins from 44 genomes[9], with an estimated
accuracy in the vicinity of 90%. The restriction can be relaxed in a number of ways,
using a Pearson correlation, Mutual information [5, 9], or mathematically exact
statistical significance assignment. In a previous paper [9] we examined each of these
methods, and settled on the last of them as a convenient and generally valid measure.
Briefly, the phylogenetic profile of a gene is a binary string recording the presence (1)
or absence (0) of a gene across a suitable set of genomes. If the correlation between
the profiles of two genes, X and Y , is much greater than would be expected by
chance, then they are assumed to be functionally related. Let N be the number of
genomes over which the profiles are defined, with gene X occurring in x genomes,
Y occurring in y genomes, and both occurring in z genomes. Then P( z | N , x, y ) , the
probability of observing z co-occurrences purely by chance, given N , x and y is [9]
 N  x  x 

 
y  z  z 
( N  x)!( N  y )! x ! y !

P ( z | N , x, y ) 

( N  z  x  y )!( x  z )!( y  z )! z ! N !
N
 
 y
(1)
The connection between equation (1) and the more readily calculated mutual
information, MI ( X , Y ) , of the profile pair, is easily if tediously established. In
particular for a given profile pair, define p(i, j ) as the joint probability of the doublet
(i, j ) where each variable can be 0 or 1, so that p (1,1) is the fraction of genomes in
3
which both genes are present, p(1, 0) is the fraction in which X is present and Y is
absent, etc. Then the relation between equation (1) and the mutual information
1
1
MI ( X , Y )   pij ( X , Y ) log
i  0 j=0
pij ( X | Y )
(2)
pi ( X )
is [10]
P ( z | N , x, y )
2 N *MI ( X ,Y ) or MI ( X , Y )= lim
N 
1
log 2 P( z | N , x, y )
N
(3a)
In this paper we therefore define a new and fully general measure of correlation
between two binary strings
C ( z | N , x, y )  
1
log 2 P( z | N , x, y )
N
0  C 1
(3b)
The simplest way to predicatively assign genes to pathways is to use a correlation
threshold, assigning an unannotated gene to a pathway of an annotated gene if the
correlation between their profiles exceeds C * , the threshold value of C . This is
standard guilt by association (SGA) [11], and assessment of its efficacy often looks
promising. For example, as indicated in Results and Discussions, a correlation
threshold C * = 0.35, ( p*  107 ) links 1025 of the 2,918 unannotated orthologs to at
least one pathway annotated gene, and 80% (820) are estimated to be correctly linked
at least once. As we indicate below, however, such assessment criteria convey an
overly optimistic picture of performance.
When an unannotated gene is linked to a gene that is in more than one pathway, the
unannotated gene is assigned to all pathways of the annotated gene; i.e. there is
nothing in the allocation criterion that favors one pathway over the other. Thus, as
shown in section 3, standard guilt by association using the usual 2- parameter profile
correlates (Pearson correlation coefficient or mutual information [9]) has either low
positive predictive value (defined as the fraction of assignments that are correct) if
coverage is high, or low coverage if positive predictive value is high.
The assumption underlying correlated enrichment (CE) is that the greater the number
and strength of signals (magnitude of correlation) from a category, the more likely the
target is to belong to the category. In particular we rank pathways using an optimized
non-linear function of total enrichment score, with eq 1 used to compute correlation.
4
One of the tasks the field faces is to develop standardized assessments that will allow
comparison between different methods. At present different authors use different
measures of performance; performance is often assessed incompletely; the same
measure is sometimes defined in different ways, and performance as a function of
coverage is more often then not, unavailable. In this paper we evaluate a complete set
of performance measures and their response characteristics as coverage is varied,
against three different ontologies. This should facilitate comparison of future
methods against our results.
In Material and Methods we develop relations for each measure of performance as an
explicit function of the variables that enter the two methods of interest (SGA and CE).
In Results and Discussion we evaluate the performance of each method as a function
of coverage, on KEGG, GO and COG. Evaluation on KEGG is the most demanding,
pathways generally corresponding to the deepest (highest resolution) levels of the GO
ontology. Thus CE correctly assigns more than 56% of the 1368 (766) KEGG
annotated orthologs to one or more of the 3 most likely pathways, whereas of the 602
putative false positives, 280 out of the 467 that are annotated in GO, share a GO
category with the pathway genes at level 5 or higher. The method is therefore
estimated to correctly allocate between 1634 and 2404 of the 2918 previously
unannotated genes to one or more functional categories.
A more general test of the method against GO categories as a function of depth
indicates a reasonably high degree of coverage (  49% ) with  82% accuracy, nearly
an eight-fold increase over previous methods, at relatively high specificity (depths 68). For the COG ontology, a gene is assigned only to the category in which the
enrichment score is highest. More specifically we assigned all 1846 unannotated
genes with an estimate, for each gene, of the probability that the assignment is correct,
based on the magnitude of the enrichment score. Finally, we identify several dozen
cliques or quasi cliques, some only partially annotated, placing unannotated genes in
evolutionarily conserved functional modules with very high probability.
Results and Discussion
5
Performance of SGA: Allocation of Orthologs to KEGG Pathways
We showed previously that restricted profiling; i.e. making functional assignments
only when profiles are identical, has very low coverage, albeit at very high reliability.
Not surprisingly, a more detailed assessment reported in Table 1, using an expanded
set of genomes, reinforces these results. Although all measures of reliability are very
high, the amount of information is very low, with only 5.4% of pathway unannotated
orthologs assignable.
Relaxing the requirement for an exact match increases coverage and the expected
number of correct predictions, but reduces other performance measures. The usual
measure of accuracy, the fraction of genes correctly assigned to at least one pathway
(i.e., the number of correctly assigned genes divided by the number of assignable
genes, A0 ), suggests a strikingly optimistic conclusion; viz, although performance
varies with coverage, it remains high at all levels of coverage (figure1). Thus we find,
for example, that at C * = 0.2 ( p*  104 ), 1136 ( Nc ) of the 1273 genes ( G ) that are
linked to one or more other genes, share at least one pathway with one of those genes
However, high coverage is achieved at low thresholds, and genes that pair at low
thresholds generally have many links; only a small fraction of such pairs are between
genes in the same pathway. For example, on average a qualifying gene at C * =0.2 is
linked to 62.81 pathways, and the number of these assignments that are correct is
2.45. Consequently, the total true positives divided by the total number of
assignments, i.e. Positive Predictive Value (PPV), is very low, 6% for this example
Nevertheless profile similarity corresponding to a chance occurrence of 10-4 generally
implies selective pressure on the correlated genes, even if their functional coupling is
not as strong as those within the same pathway. In particular of the 48,691 links that
are putative false positives in KEGG, 34,024 are annotated at depth 4 in GO. Of these,
46% share the same GO category. The probability that such concordance occurs by
chance is negligible.
More generally PPV and specificity (TN /(TN  FP)) decrease monotonically as
coverage increases (Figure 1). Such behavior is expected intuitively since increased
coverage requires decreased threshold stringency, which in turn means that the
6
number of false positives increases. However, although PPV drops to zero relatively
rapidly as coverage increases to 1, specificity remains high over a wide range of
coverages, and sensitivity is high over all coverages. The specificity-sensitivity curves
cross at a remarkably high coverage, just under 80%, where their values are just over
80%.
Somewhat less intuitive is that sensitivity (TP /(TP  FN )) is not a monotonic function
of coverage, but reaches a minimum of around 70% (at approximately 60% coverage)
and then rises again toward 1 as coverage continues to increase. The behavior can be
understood by recognizing that in the high threshold limit, the number of assignments
decreases while the fraction that is correctly assigned increases. Since sensitivity is
the fraction of correctly predicted pathways (eq 4a) it increases in this limit. On the
other hand in the low threshold limit all genes are linked and all correct pathways are
recovered. The minimum, found at moderate thresholds, reflects the tradeoff between
link number and link reliability.
A0, the fraction of genes allocated correctly to at least one pathway, is expected to
have the same behavior as sensitivity, as indicated in the discussion following eq 4 in
Material and Methods. The consistently high specificity with increasing coverage,
even as PPV drops, reflects a constraining relation between TP, FP and TN; viz,
TP  FP  TN .
Performance of Correlated Enrichment: Allocation of Orthologs to KEGG
Pathways
When gene assignment is restricted to three categories (GO or KEGG) having the
largest signal, as described in Material and Methods, the total number of assignments
decreases markedly relatively to those made by SGA, as does the number of false
positives, while the number of true positives changes little. PPV is markedly
increased at high coverage, but is essentially unchanged relative to SGA for coverage
below 30% (Figure2).
PPV estimates must be viewed as conservative: assignment to an unannotated
pathway could mean that the presence of the gene in that path has not yet been
7
discovered. That many of these putative false positives are in fact functionally related
to the target is seen by searching the GO ontology. In particular we find that of the
467 COGs that are allocated to KEGG pathways in which they are currently not
annotated, more than 60% share at least one GO category with the pathway genes at a
depth of 5 or greater, and more than 70% at a depth of 4 or greater.
One way to assess the methods as a function of both coverage and PPV is to estimate
the entropy change or reduction in uncertainty in the unannotated gene population. An
entropy measure tends to give coverage a high weight, since even low PPV resolves
some uncertainty, whereas when no prediction can be made about a gene, no
uncertainty is resolved (Table 2). It’s obvious that while identical profiling gives the
best average positive predictive value per gene ( AC =1.70/2.26), CE resolves the
greatest amount of uncertainty[12]. Different thresholds indicate a tradeoff between
accuracy and coverage. For SGA allocation at C *  0.1 gives the largest decrease in
entropy while for CE, C *  0 . It should be noted that at C *  0 , CE takes account of
pairwise correlations between the target and all genes in a category; i.e. although there
is no cutoff, profile pair correlations still enter strongly into the decision. The low
optimal cutoff values are a consequence of the definition of entropy reduction which
gives high weight to coverage. It is evident that the total information gain may not
always be the criterion of choice, depending on the question of interest to the user.
Validation and Predictions for Model Organisms using the GO
Evaluations based on orthologs are in a sense reference evaluations, not precisely
applicable to any particular genome. In order to discuss performance in a specific
instance, and to better compare different methods, we have evaluated A0 and
coverage for E coli against GO. Figure 3 plots results obtained using correlation
enrichment. At all correlations A0 as a function of GO depth drops to a minimum and
then increases, reflecting a tradeoff between the rates at which number of correctly
assigned genes and the number of assignable genes change with depth.
Coverage is relatively invariant across GO depths. .For a correlation threshold of 0.55
( p*  1011 ), A0 is 78% at depth 7, where the coverage is 48%, and 81% at the most
specific depth (level 10), where the coverage is 45%. When C * = 0.9
8
( p*  1018 ) A0 is above 90% for all annotation specificity levels, and comparable to
that obtained with identical profiles, while the minimum A0 (at level 7) gives
coverage of (185/1225)15%.
Some perspective on performance can be obtained by comparison, to the extent
possible, with other high performance computational and experimental methods. In
particular, McDermott and Samudrala [13] estimated A0 using, in essence, SGA with
mutual information [5, 14] They found that A0 as a function of GO depth varies from
98% at the least specific level (3) where coverage is 50%, to 10% at the most specific
level (10), the latter being about eight fold lower than the result using CE at C * = 0.55
at comparable coverage. With respect to experimental high throughput methods, Date
and Marcotte (2003 #3) find that SGA using MI has accuracy comparable to that of
yeast two hybrid screens.
Intergenomic Correlations
The above assessments, and those commonly made of phylogenetic profiling, usually
leave aside discussion of a fundamental limitation; viz, the confounding influence of
correlations between genomes, as opposed to correlations between genes. While we
do not report a complete study of the effect of inter-genome correlations, we
estimated its potential influence by collapsing those genomes that are
phylogenetically close, essentially assuming that all correlations between gene pairs
that are present within a group of related genomes are the result of genome correlation
rather than gene correlation. We find that, for this conservative model, genome
correlations have only a small effect on the performance of the method given a
reasonable number of lineages. In fact, when 66 genomes are collapsed to as few as
32 lineages based on their phylogenetic distances, the corresponding change in PPV
for the same coverage is always less than 1%.
Functional and Evolutionary Modules
Annotating the COG Ontology COG functional categories provide a low resolution,
but fully resolved annotation. Because the COG ontology is a one to one gene:
functional category map, performance assessment is relatively direct; in particular,
PPV = A0 (eq 17).
9
The number of CE annotated genes by category falls more or less linearly, and is
generally concordant with the distribution of previously annotated genes among the
categories; the latter having translation (J), amino acid metabolism (E) and energy
production and conversion (C) as the largest categories, and signal transduction (T)
and cell division (D) as the smallest. Such mimicry is inherent in the CE method.
The full set of 4,826 genes, annotated and unannotated, cluster into groups of various
sizes (figure 4). An all against all profiling at a threshold of C * = 0.55, returns a 926
gene network fragment, which was processed using the VisANT visual mining tool
([15], visant.bu.edu) to produce figures 5 and 6. The network contains 677 annotated
orthologs, 463 of which are assigned correctly and unambiguously to their COG
category. In addition, (926 - 677 =) 249 previously unannotated orthologs are
assigned explicitly to one or another category. Sets of genes assigned to the same
COG functional categories are grouped together into meta-nodes (5B), each
containing genes that are classifiable as (Figure 5C) true or false positives (for
annotated genes), or predictions (for unannotated genes). In particular, of the 82 genes
allocated to category H (coenzyme metabolism), 62 of 74 are annotated in category H
(PPV = 0.84), and eight others are predictions. Since the PPV is high, the predictions
are likely to be correct. All the predictions using COG functional categories can be
web accessed [16].
A more detailed version of the TP set (Figure 5) reveals two strikingly dense
clusters—one with 7 orthologs, the other with 11. All the elements of the latter
participate in Porphyrin and chlorophyll metabolism pathway (00860). The cluster is a
highly interdependent functional module and it is also strikingly conserved as
demonstrated by its aligned profiles (Figure 6). The elements in the seven member
cluster are not annotated in KEGG. However, four of them are annotated in GO and
they all share GO category 0006777: molybdopterin cofactor biosynthesis, at depth 8.
It therefore appears likely that the remaining 3 COGs are important components of
molybdopterin cofactor biosynthesis in one or more genomes. These results indicate
the power of CE to uncover evolutionarily conserved highly specific functional
modules, and to reliably assign previously unannotated genes to these modules.
10
We have already commented on the problem of comparative performance assessment
resulting from disparate definitions in the literature. A specific illustration is shown in
Figure 7 which compares two different performance measures, PPV, and a measure of
accuracy used by [17]). It is evident that PPV provides a much less impressive picture
of SGA, showing almost a 2-fold difference in accuracy over a range in coverage.
Cliques, clusters and inference quality
The above results are obtained by assigning genes to categories based on CE or SGA,
and then obtaining clusters by displaying links between intra-category genes that meet
the threshold condition. This method is useful if the primary goal is to annotate genes,
but it is not efficient if the primary goal is to find sets of tightly linked clusters. The
simplest procedure for finding clusters is to set a threshold, discard all genes that
don’t meet it, and connect those whose profiles meet or exceed it. We examine cliques
(fully connected clusters) and quasi-cliques[18] in our functional linkage network at
different C * and estimate their predictive power.
As the threshold decreases from C * = 0.9, the number of clusters containing more than
3 nodes increases, peaking at C * = 0.66 ( p*  1013 ) and then declining as the nodes
coalesce into increasingly larger clusters (Figure 8). At C * = 0.9, four clusters are
cliques, including a 9-component flagella assembly pathway (Figure 9a). There are
nine cliques and quasi-cliques at C * = 0.81 ( p*  1016 ) and twenty (ranging from size
4 to 19) at C * = 0.71 ( p*  1014 ). Of the nine fully determined cliques (functions of
all genes known) ranging in size from 4 to 10, 8 are homogeneous in COG function,
and 5 are homogenous in both KEGG and COG functions.
Clusters of unannotated orthologs begin to appear at 10-16. Of the 9 cliques or quasi
cliques with unannotated orthologs that are uncovered at C * = 0.71, one has 12 genes,
with 4 unannotated, and 8 annotated in COG category M (membrane biosynthesis)
and KEGG pathway lipopolysaccharide biosynthesis.
Figure 9 shows some examples. The fully annotated nine node (homogeneous in the
flagella assembly pathway), and six-node cliques (homogeneous in the histadine
metabolism pathway) clearly qualify as both evolutionary and functional modules.
11
The other three cliques (c) - (e), obtained at C * = 0.71 have both annotated and
unannotated nodes. They are evolutionary and very likely also functional. The 12node module (c) has eight of its members in the KEGG lipopolysaccharide
metabolism pathway. Since it is fully connected, with all linkage strengths equal to or
greater than C * = 0.71, enrichment for he lipopolysaccharide metabolism pathway is
very strong, and each of the unknown COGs is almost certainly associated with that
function. A very weak enrichment-based lower bound on PPV is 0.78.
Similar remarks hold for the six node clique, which has four genes implicated in the
aminosugars metabolism pathway. The smallest clique has two of its 4 genes
annotated in different pathways—ubiquinone biosynthesis and oxidative
phosphorylation. It therefore appears plausible that the two unannotated COGs are
likely to be mitochondria related, participating in the latter part of the respiratory
chain during ubiquinone biosynthesis. They are also unambiguously assigned to COG
category P, suggesting a possible role in transport across the mitochondrial
membrane.
Pressure to Reduce Pleiotropy
The premise of profiling is that functionally related pairs tend to be conserved. The
tighter the relation is (e.g. the deeper the level of GO depth), the greater the
correlation between profiles. As functional tightening increases, however, there is
pressure not just to co-evolve, but to reduce the pleiotropy of the pair; i.e. the function
of each gene tends to become increasingly specialized and restricted to the functions
(e.g. pathways) of its partner (figure 10 upper curve). Such pressure reduces
informational entropy, just as physical pressure reduces thermodynamic entropy, all
other variables held constant. Hence we see convergence between functional
specificity reflected in a reduction in the average number of paths per linked gene
(upper curve) and the increase in the number of correctly predicted paths per gene
(lower curve). Such tendency toward increased specialization is perhaps most
pronounced for biosynthetic pathways ([19]) as would be expected of processes that
are strongly interdependent and forced to co-evolve, even in the presence of adaptive
change in more loosely associated genes.
12
Materials and Methods
The dataset
We adhere to the conventions of the COG database
(http://www.ncbi.nlm.nih.gov/COG/) ([20, 21]) and construct profiles only for genes
that occur in at least three lineages. All paralogs are collapsed; i.e. a set of closely
related genes in a given lineage is treated as a single entity. The analysis was
performed for 4,873 clusters of orthologs (COGs) from 66 fully sequenced microbial
genomes in the three domains of life. Accuracy is evaluated against the Kyoto
Encyclopedia of Genes and Genomes (http://www.genome.ad.jp/kegg/kegg2.html,
[22]); the Gene Ontology Consortium (2000) (http://www.geneontology.org/ [23]); 23
COG broad functional categories; annotations for 6059 Saccharomyces cerevisiae
ORFs (SGD http://www.yeastgenome.org/) and 4410 E.coli K12 ORFs (EcoCyc
http://ecocyc.org/). Of the 206 biochemical pathways in KEGG, we used 133 (mostly
metabolic pathways); in particular those that are generic. These pathways contain a
total of 1,368 orthologs. Sets of 20 or more COGs that have the same phylogenetic
profile were eliminated from the dataset.
Pathway assignments
Definitions
Eq 3b can be used to assign unannotated genes in a number of ways. Here we explore
two relatively simple criteria and rigorously evaluate the reliability of both by
quantitative assessments of the full contingency matrix as a function of coverage. In
particular, for each target gene we assess true positives (TP), i.e. the number of
categories (KEGG pathways; GO categories, or COG ontology categories) correctly
assigned; false positives (FP), the number of categories incorrectly assigned; true
negatives (TN), the number categories to which the target is not assigned, and in
which it is not annotated; false negatives (FN) the number of categories to which the
target is not assigned and in which it is annotated. The sensitivity (SEN), specificity
(SPE), accuracy (ACC) and positive predictive value (PPV) are functions of these four
quantities
13
SEN  TP /(TP  FN )
SPE  TN /(TN  FP )
ACC  (TP  TN ) /(TP  TN  FP  FN )
PPV  TP /(TP  FP )
(4a)
(4b)
(4c)
(4d)
Related metrics
SPE-ACC For category allocation, specificity and accuracy will be quantitatively very
similar; i.e. true negatives will invariably be much greater than true positives and false
negatives, owing to the fact that the vast majority of genes are in a small fraction of all
pathways. Consequently we expect SPE  ACC .
SEN-A0 Whereas SPE and ACC are quantitatively similar, SEN and the fraction of
genes that are correctly allocated at least once are qualitatively similar, as explained
below. The similarity is strong enough so that they provide the same measure of
performance. Hence of the 5 measures, only three are independent. These are
traditionally taken as SEN, SPE and PPV. (A fourth measure, negative predictive
value, which adds little to the discussion, is omitted in the interest of brevity).
Performance is measured by their functional dependence on coverage. These
definitions are introduced in terms of a particular gene. Passing to population averaged
quantities is in principle direct, although in practice it involves some care because of
cross correlations between categories.
The final quantity of interest is coverage, defined as the fraction of genes (unannotated
or annotated) that can be linked to at least one annotated gene.
Positive Predictive Value
Each measure serves its own purpose. However, a quantity of natural interest for
predictions based on thresholding is the fraction of assignments that are correct; i.e.
the positive predictive value.
By definition, the population averaged positive predictive value is
14
1
PPV 
Na
Na
 PPV
I 1
(5)
I
where N a is the total number of annotated genes linked at C* and PPVI, the positive
predictive value for target gene I , is given by eq 4d. Analogous equations hold for
the other measures of performance.
It is instructive to express the PPV as a product of two factors: the fraction of genes
that are correctly assigned to at least one functional category ( A0 ), and the average
fraction of those assignments that are correct ( AC ). Let Nc be the number of genes
that are assigned correctly to at least one functional category. Then
A0 
1
AC 
Nc
Nc
Na
Nc
TPI
1


Nc
I 1 TPI  FPI
(6)
Nc
 PPV
I 1
I
(7)
and the population averaged positive predictive value is
PPV  A0 AC
(8)
i.e. AC is the average positive predictive value of genes that are assigned correctly at
least once, and A0 is the fraction of annotated genes assigned correctly at least once.
Other measures of reliability cannot be similarly decomposed.
Although A0 is sometimes used as a measure of PPV (and sometimes referred to as
accuracy), in general, A0 is a very poor measure of PPV and provides an overly
optimistic assessment of performance.
We apply and assess two decision criteria one based on SGA, the other on CE.
Standard Guilt by Association
An unannotated target gene generally meets the threshold condition
C ( z | N , x, y )  C *
15
with multiple genes, and each associated gene typically participates in more than one
process. The target is of necessity assigned to all categories of the gene to which it is
linked.
In order to develop performance measures, let i be the number of the categories that
contain the target, I , whose function is to be predicted; let j ( I , J ) be the number of
categories that contain a gene J whose profile correlation with I meets the threshold
C*, and let k ( I , J ) denote the number of common categories; where
0  k ( I , J )  min(i, j ) .The target gene is therefore correctly assigned to TP = k
categories, and incorrectly assigned to the remaining FP = j  k categories. Also TN =
T – i – j + k and FN = i - k, where T = 133 is the total number of pathways.
Consequently, the PPVI ( J ) with which gene I is assigned using linked gene J is
PPVI ( J ) 
k
j
(9)
Note that the maximum PPVI ( J ) is not necessarily 1, but min(i, j ) / j .
For j  i, PPVI  1 , whereas when i > j, PPVI ( J ) can become 1 when the pathways of
J are a subset of those of I. The positive predictive value for gene I is obtained by
taking sums over all genes to which it is correlated.
Nc ( I )
G(I )
PPVI 
 k (I , J )
J 1
G(I )
 j(I , J )
J 1

 k (I , J )
J 1
G(I )
 j(I , J )
(10)
J 1
where G(I) is the number of genes correlated with gene I and Nc(I) is the subset of
genes in G(I) that share at least 1 category with gene I. Here union symbol is used
instead of a sum to indicate avoidance of double counting when a category has more
than a single gene linked to the target gene. Ac is given by substituting eq 10 into eq 7.
Correlation Enrichment
*
Suppose that the target gene I is correlated with g I other genes ( C  C ), and
let m1 , m2 ,..., mr be the number of correlated genes in categories C1 , C2 ...Cr ,
where r  g I , the equality holding only when each gene is in one category.
16
Further, let C1' , C1' ,..., CT' I denote the TI categories the target gene is in. For each
of the r categories that have 1 or more genes meeting the correlation threshold
with I, define a weighted sum score, sv
mv
sv   [ log PI , g j ]
j 1
v =1… r (11)
 is a positive integer. Thus a linked pathway is weighted by a combination of the
number of genes in the pathway, which exceed the threshold, and the similarity of
those genes to the one being tested. Unweighted ranking, in which only the number of
genes is used, is a special case with  = 0. Tests using different  indicate that 
= 4 is optimal. PI , g j is calculated from equation (1) using the profile of gene I and
that of gene g j in category Cv . The category scores sv are ranked in descending
order and the target gene is allocated to the top r0 categories. The number of true
positives is the intersection between the categories the target is in (TI), and these r0
categories. Then FP= r0 -TP, FN = TI – TP, TN = T- r0- TI +TP
r0
TI
 (C  C 'j )
TP
PPVI 
 
TP  FP j 1  1
r0
0, Cv  C 'j
where  ( j  v)  {
1, Cv  C 'j
(12)
and 0  PPVI  1
An analysis of KEGG and GO indicates that the average number of functional
categories per gene is between 2 and 3. It would therefore seem reasonable to take
r0 = 3 for KEGG and GO, where a relatively large number of categories is available;
i.e. we allocate to at most 3 categories. We use the more stringent condition r0 = 1, for
the relatively coarse grained COG ontology. For COG categories, PPVI  1 or 0 ,
PPV  A0  Nc / N a .
(13)
17
Finally it is useful to ask how much each method reduces the uncertainty in a
population of unannotated genes. This can be answered by estimating the Shannon
information; i.e. the entropy drop [12].
List of abbreviations
SGA: Standard Guilt by Association
CE: Correlation Enrichment
GO: Gene Ontology
KEGG: Kyoto Encyclopedia of Genes and Genomes
COG: Clusters of orthologous groups
PPV: Positive Predictive Value
SEN: Sensitivity
SPE: Specificity
Acknowledgements
This work has been supported by NIGMS, NIH (P20 GM66401). Scripts related to
this study are available upon request.
References
1.
2.
3.
4.
5.
6.
Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction
maps for complete genomes based on gene fusion events. Nature 1999,
402(6757):86-90.
Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A
combined algorithm for genome-wide prediction of protein function.
Nature 1999, 402(6757):83-86.
Yanai I, Derti A, DeLisi C: Genes linked by fusion events are generally of
the same functional category: a systematic analysis of 30 microbial
genomes. Proc Natl Acad Sci U S A 2001, 98(14):7940-7945.
Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene
clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999,
96(6):2896-2901.
Date SV, Marcotte EM: Discovery of uncharacterized cellular systems by
genome-wide analysis of functional linkages. Nat Biotechnol 2003,
21(9):1055-1062.
Gaasterland T, Ragan MA: Microbial genescapes: phyletic and functional
patterns of ORF distribution among prokaryotes. Microb Comp Genomics
1998, 3(4):199-217.
18
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
Huynen M, Snel B, Lathe W, 3rd, Bork P: Predicting protein function by
genomic context: quantitative evaluation and qualitative inferences.
Genome Res 2000, 10(8):1204-1210.
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO:
Assigning protein functions by comparative genome analysis: protein
phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285-4288.
Wu J, Kasif S, DeLisi C: Identification of functional links between genes
using phylogenetic profiles. Bioinformatics 2003, 19(12):1524-1530.
Relationship between mutual information MI and probability P.
[http://visant.bu.edu/jiewu/MI.htm].
Aravind L: Guilt by association: contextual information in genome
analysis. Genome Res 2000, 10(8):1074-1077.
Entropy Change in the unannotated gene population.
[http://visant.bu.edu/jiewu/entropy.htm].
McDermott J, Samudrala R: Enhanced functional information from
predicted protein networks. Trends Biotechnol 2004, 22(2):60-62;
discussion 62-63.
Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network
of yeast genes. Science 2004, 306(5701):1555-1558.
Hu Z, Mellor J, Wu J, DeLisi C: VisANT: an online visualization and
analysis tool for biological interaction data. BMC Bioinformatics 2004,
5(1):17.
COG predictions using CE. [http://visant.bu.edu/jiewu/COGpredictions.htm].
Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for
global discovery of conserved genetic modules. Science 2003,
302(5643):249-255.
At the thresholds in [16], only two clusters are not cliques, the least cliquish
node in one has a clustering coefficient of 0.91, and the least cliquish node in
the other has a clustering coefficient of 0.74, where the clustering coefficient
of a node = total number of observed links between its neighbors normalized
by the maximum possible number of links between neighbors.
Snel B, Huynen MA: Quantifying modularity in the evolution of
biomolecular systems. Genome Res 2004, 14(3):391-397.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV,
Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN et al: The COG
database: an updated version includes eukaryotes. BMC Bioinformatics
2003, 4(1):41.
Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao
BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG
database: new developments in phylogenetic classification of proteins
from complete genomes. Nucleic Acids Res 2001, 29(1):22-28.
Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res 2000, 28(1):27-30.
Creating the gene ontology resource: design and implementation. Genome
Res 2001, 11(8):1425-1433.
Figure Legends
19
Figure 1. Measures of performance as a function of coverage using SGA to allocate unknown
genes to KEGG pathways.
Figure 2. Performance as a function of coverage with allocation using CE. The measures are
defined in Material and methods.
Figure 3. PPV as a function of GO category from broadest (level 3) to most specific (level 10). The
labels appearing along the lowest curve C = 0.2 (P=10-4) are fractional coverages; i.e. the number of
annotated genes linked to at least one other gene, normalized by the total number of annotated genes at
the depth shown. The other curves display only the numerators, the denominators being unchanged.
Figure 4. An all against all screen short, at C* = 0.55, of the 4286 orthologs in the COG database
using SGA 926 genes (677 annotated; 249 unannotated) are linked to at least one annotated gene. Each
gene is unambiguously assigned to a unique COG functional category. Of these 677 annotated genes,
463 are correctly assigned; In total 1843 out of 4286 cogs are unannotated in the COG classification
(A) Complete 926 gene network. (B) meta-network of genes from (A). Each group represents a set of
genes allocated to a COG functional category. (C) Detail of functional category H, coenzyme
metabolism. (D). Of the 926 linked genes, 82 are in category H. 62 of them are true positives (green)
and 8 are predictions (red). The remaining 12 are annotated in a different functional category and are
therefore putative false positives. The minimum PPV for category H is therefore 62/74 = 0.84 the
averaged PPV for all categories is 68%.Refer to the COG web site for definitions of categories.
Coverage = 62/142 = 44% neglecting predicted genes.
Figure 5. Expanded view of true positive and prediction cluster in figure 5C, showing two
strikingly dense clusters of size 11 and 7 respectively.
Figure 6. Phylogenetic Profiles of an 11-member cluster of orthologs across 66 genomes
uncovered by CE. Green represents absence and red, presence.
Figure 7. Comparison of accuracy against coverage of COG ontology allocations, illustrating the
effect of definition on assessment of performance. The method in both cases is SGA. The upper
curve is the number of links between gene pairs having the same function (COG category H)
normalized by the number of links formed by genes with that function [17]. The lower curve uses PPV
(correctly predicted pathways normalized by total predicted pathways) as accuracy measure.
Figure 8. Number of clusters (size >= 3) as a function of C*.
Figure 9. Evolutionarily conserved fully connected clusters. Edge coding: black, C* = 0.9 (P*=1018
); yellow, C* = 0.81 (P*=10-16); blue, C* = 0.71 (P*=10-14). Green nodes correctly annotated; red
nodes unannotated. (a) One of 4 cliques uncovered at C* = 0.9. All genes are known and are in the
flagella assembly pathway. (b) One of 9 clusters at 10-16, ranging in size from 4 to 13 nodes. All genes
20
are annotated to the KEGG histadine metabolism pathway and to the COG amino acid metabolism and
transport category. (c), - (e) are examples of mixed annotated-unannotated clusters, with the annotated
sets homogeneous in function. Lower bounds on PPV for assigned functions are 79% (C*=0.71) and
89% (C* = 0.81).
Figure 10. Pressure to reduce pleiotropy. The average number of pathways per linked gene
decreases for stringent C* while average number of correctly predicted pathways per linked gene
increases.
Tables and Captions
Table 1. Pathway allocation performance of using exactly matching phylogenetic
profiles. AA (UU) denotes pairs in which both genes are annotated (unannotated). N is the number
of links. G is the number of genes that form those links (unannotated genes in AU). N * is the number of
links between genes that share at least one path; G* is the number of such genes. PPV, A0, AC,
sensitivity and specificity are defined as in Material and Methods.
N
N*
G
G*
AA
288
254
249
234
AU
271
(239)
159
(149)
UU
1090
NA
603
NA
PPV
A0
AC
SEN
SPC
85%
94%
90%
91%
99%
Table 2. Performance evaluation for different methods using entropy change. G is
the number of genes that can be assigned to at least one pathway. Gc is the number of genes assigned
correctly to at least one path. Gp is the average number of pathways assigned per gene. Gpc is the
average number of correct pathways for correctly assigned genes. H is the entropy drop in the system.
Nos. in parenthesis are estimated from annotated genes.
Method/Thres
hold
G
Gc
Gp
Gpc
Identical
A
249
234
2.03
1.53
profiling
U
159
(149)
2.26
(1.70)
A 1355
1321
107.4
2.71
U 2897
(2824)
84.6
(2.13)
A 1273
1136
62.81
2.45
SGA
2
4
H
1277.15
2777.63
2537.98
21
U 2170
8
0
CE
8
15
(1936)
45.74
(1.78)
A
801
634
15.76
1.71
U
723
(572)
17.64
(1.91)
A 1368
765
3.0
1.29
U 2918
(1632)
3.0
(1.29)
A
801
532
3.0
1.33
U
723
(480)
3.0
(1.33)
A
178
168
3.0
1.49
U
27
(25)
3.0
(1.49)
1979.48
6238.50
2128.69
151.0
22
Download