Gene Annotation and Network Inference by Phylogenetic Profiling Jie Wu1, Zhenjun Hu2, Charles DeLisi1, 2 1 Department of Biomedical Engineering, 2 Bioinformatics Graduate Program, Boston University, 44 Cummington St., Boston, MA, 02215, USA Wu J: jiewu@bu.edu Hu Z: zjhu@bu.edu DeLisi C: delisi@bu.edu Correspondence: Charles DeLisi. E-mail: delisi@bu.edu Abstract Background: Phylogenetic analysis is emerging as one of the most informative computational methods for gene annotation, and identification of evolutionary and functional modules. The effectiveness with which phylogenetic information can be utilized depends in part on the availability of both an appropriate measure of correlation between binary strings of arbitrary length and composition, and an effective high performance method to use the correlate. Current phylogeneticallybased methods, though useful, perform at a level well below what is possible. Beyond that, informative and standardized measures of performance are not in use, making comparisons between different methods difficult; those that have been used convey overly optimistic assessments of performance. Results: We introduce a rigorous and general measure of correlation, a new category enrichment based method for using it to assign genes to functional categories at various levels of resolution, and a complete and rigorous performance assessment, enabling full assessment of yet to be developed methods against our results. In a systematic allocation test, more than 56% of the 1368 (766) KEGG annotated orthologs were correctly assigned to one or more of the 3 most likely pathways. Of the 602 putative false positives, 467 are annotated in the Gene Ontology (GO), and 280 of those share a GO category with the pathway genes at level 5 or higher. The method is therefore estimated to correctly allocate between 1634 and 2404 of the 2918 previously unannotated genes to one or more functional categories, at a relatively high level of resolution. A general comparison using mutual information as a correlate, and standard guilt by association to draw inferences indicates that the method used here has a positive predictive value nearly eightfold higher at comparable coverage. The method also assigns accuracies to annotations of individual genes, rather than just population averaged reliabilities. Conclusions: The method serves as a general computational tool for annotating large numbers of unknown genes, and uncovering evolutionary and functional modules. It appears to perform substantially better than extant stand-alone high throughout methods. 2 Background One of the remarkable characteristics of the genomic era is that the solution to the challenge of annotation posed by the rapid increase in sequence, comes in part from the data itself; i.e. the availability of a large number of fully sequenced genomes provides information that enables the development of new computational approaches including domain fusion [1-3], chromosomal proximity [4]and phylogenetic profiling [5-8]. Phylogenetic profiling, in its original form, was used to infer the function of a gene by finding another gene of known function with an identical pattern of presence and absence across a set of phylogenetically distributed genomes. Such restricted profiling, requiring full profile identity, while accurate, has low coverage, assigning pathways to 114 of 1814 unknown orthologous proteins from 44 genomes[9], with an estimated accuracy in the vicinity of 90%. The restriction can be relaxed in a number of ways, using a Pearson correlation, Mutual information [5, 9], or mathematically exact statistical significance assignment. In a previous paper [9] we examined each of these methods, and settled on the last of them as a convenient and generally valid measure. Briefly, the phylogenetic profile of a gene is a binary string recording the presence (1) or absence (0) of a gene across a suitable set of genomes. If the correlation between the profiles of two genes, X and Y , is much greater than would be expected by chance, then they are assumed to be functionally related. Let N be the number of genomes over which the profiles are defined, with gene X occurring in x genomes, Y occurring in y genomes, and both occurring in z genomes. Then P( z | N , x, y ) , the probability of observing z co-occurrences purely by chance, given N , x and y is [9] N x x y z z ( N x)!( N y )! x ! y ! P ( z | N , x, y ) ( N z x y )!( x z )!( y z )! z ! N ! N y (1) The connection between equation (1) and the more readily calculated mutual information, MI ( X , Y ) , of the profile pair, is easily if tediously established. In particular for a given profile pair, define p(i, j ) as the joint probability of the doublet (i, j ) where each variable can be 0 or 1, so that p (1,1) is the fraction of genomes in 3 which both genes are present, p(1, 0) is the fraction in which X is present and Y is absent, etc. Then the relation between equation (1) and the mutual information 1 1 MI ( X , Y ) pij ( X , Y ) log i 0 j=0 pij ( X | Y ) (2) pi ( X ) is [10] P ( z | N , x, y ) 2 N *MI ( X ,Y ) or MI ( X , Y )= lim N 1 log 2 P( z | N , x, y ) N (3a) In this paper we therefore define a new and fully general measure of correlation between two binary strings C ( z | N , x, y ) 1 log 2 P( z | N , x, y ) N 0 C 1 (3b) The simplest way to predicatively assign genes to pathways is to use a correlation threshold, assigning an unannotated gene to a pathway of an annotated gene if the correlation between their profiles exceeds C * , the threshold value of C . This is standard guilt by association (SGA) [11], and assessment of its efficacy often looks promising. For example, as indicated in Results and Discussions, a correlation threshold C * = 0.35, ( p* 107 ) links 1025 of the 2,918 unannotated orthologs to at least one pathway annotated gene, and 80% (820) are estimated to be correctly linked at least once. As we indicate below, however, such assessment criteria convey an overly optimistic picture of performance. When an unannotated gene is linked to a gene that is in more than one pathway, the unannotated gene is assigned to all pathways of the annotated gene; i.e. there is nothing in the allocation criterion that favors one pathway over the other. Thus, as shown in section 3, standard guilt by association using the usual 2- parameter profile correlates (Pearson correlation coefficient or mutual information [9]) has either low positive predictive value (defined as the fraction of assignments that are correct) if coverage is high, or low coverage if positive predictive value is high. The assumption underlying correlated enrichment (CE) is that the greater the number and strength of signals (magnitude of correlation) from a category, the more likely the target is to belong to the category. In particular we rank pathways using an optimized non-linear function of total enrichment score, with eq 1 used to compute correlation. 4 One of the tasks the field faces is to develop standardized assessments that will allow comparison between different methods. At present different authors use different measures of performance; performance is often assessed incompletely; the same measure is sometimes defined in different ways, and performance as a function of coverage is more often then not, unavailable. In this paper we evaluate a complete set of performance measures and their response characteristics as coverage is varied, against three different ontologies. This should facilitate comparison of future methods against our results. In Material and Methods we develop relations for each measure of performance as an explicit function of the variables that enter the two methods of interest (SGA and CE). In Results and Discussion we evaluate the performance of each method as a function of coverage, on KEGG, GO and COG. Evaluation on KEGG is the most demanding, pathways generally corresponding to the deepest (highest resolution) levels of the GO ontology. Thus CE correctly assigns more than 56% of the 1368 (766) KEGG annotated orthologs to one or more of the 3 most likely pathways, whereas of the 602 putative false positives, 280 out of the 467 that are annotated in GO, share a GO category with the pathway genes at level 5 or higher. The method is therefore estimated to correctly allocate between 1634 and 2404 of the 2918 previously unannotated genes to one or more functional categories. A more general test of the method against GO categories as a function of depth indicates a reasonably high degree of coverage ( 49% ) with 82% accuracy, nearly an eight-fold increase over previous methods, at relatively high specificity (depths 68). For the COG ontology, a gene is assigned only to the category in which the enrichment score is highest. More specifically we assigned all 1846 unannotated genes with an estimate, for each gene, of the probability that the assignment is correct, based on the magnitude of the enrichment score. Finally, we identify several dozen cliques or quasi cliques, some only partially annotated, placing unannotated genes in evolutionarily conserved functional modules with very high probability. Results and Discussion 5 Performance of SGA: Allocation of Orthologs to KEGG Pathways We showed previously that restricted profiling; i.e. making functional assignments only when profiles are identical, has very low coverage, albeit at very high reliability. Not surprisingly, a more detailed assessment reported in Table 1, using an expanded set of genomes, reinforces these results. Although all measures of reliability are very high, the amount of information is very low, with only 5.4% of pathway unannotated orthologs assignable. Relaxing the requirement for an exact match increases coverage and the expected number of correct predictions, but reduces other performance measures. The usual measure of accuracy, the fraction of genes correctly assigned to at least one pathway (i.e., the number of correctly assigned genes divided by the number of assignable genes, A0 ), suggests a strikingly optimistic conclusion; viz, although performance varies with coverage, it remains high at all levels of coverage (figure1). Thus we find, for example, that at C * = 0.2 ( p* 104 ), 1136 ( Nc ) of the 1273 genes ( G ) that are linked to one or more other genes, share at least one pathway with one of those genes However, high coverage is achieved at low thresholds, and genes that pair at low thresholds generally have many links; only a small fraction of such pairs are between genes in the same pathway. For example, on average a qualifying gene at C * =0.2 is linked to 62.81 pathways, and the number of these assignments that are correct is 2.45. Consequently, the total true positives divided by the total number of assignments, i.e. Positive Predictive Value (PPV), is very low, 6% for this example Nevertheless profile similarity corresponding to a chance occurrence of 10-4 generally implies selective pressure on the correlated genes, even if their functional coupling is not as strong as those within the same pathway. In particular of the 48,691 links that are putative false positives in KEGG, 34,024 are annotated at depth 4 in GO. Of these, 46% share the same GO category. The probability that such concordance occurs by chance is negligible. More generally PPV and specificity (TN /(TN FP)) decrease monotonically as coverage increases (Figure 1). Such behavior is expected intuitively since increased coverage requires decreased threshold stringency, which in turn means that the 6 number of false positives increases. However, although PPV drops to zero relatively rapidly as coverage increases to 1, specificity remains high over a wide range of coverages, and sensitivity is high over all coverages. The specificity-sensitivity curves cross at a remarkably high coverage, just under 80%, where their values are just over 80%. Somewhat less intuitive is that sensitivity (TP /(TP FN )) is not a monotonic function of coverage, but reaches a minimum of around 70% (at approximately 60% coverage) and then rises again toward 1 as coverage continues to increase. The behavior can be understood by recognizing that in the high threshold limit, the number of assignments decreases while the fraction that is correctly assigned increases. Since sensitivity is the fraction of correctly predicted pathways (eq 4a) it increases in this limit. On the other hand in the low threshold limit all genes are linked and all correct pathways are recovered. The minimum, found at moderate thresholds, reflects the tradeoff between link number and link reliability. A0, the fraction of genes allocated correctly to at least one pathway, is expected to have the same behavior as sensitivity, as indicated in the discussion following eq 4 in Material and Methods. The consistently high specificity with increasing coverage, even as PPV drops, reflects a constraining relation between TP, FP and TN; viz, TP FP TN . Performance of Correlated Enrichment: Allocation of Orthologs to KEGG Pathways When gene assignment is restricted to three categories (GO or KEGG) having the largest signal, as described in Material and Methods, the total number of assignments decreases markedly relatively to those made by SGA, as does the number of false positives, while the number of true positives changes little. PPV is markedly increased at high coverage, but is essentially unchanged relative to SGA for coverage below 30% (Figure2). PPV estimates must be viewed as conservative: assignment to an unannotated pathway could mean that the presence of the gene in that path has not yet been 7 discovered. That many of these putative false positives are in fact functionally related to the target is seen by searching the GO ontology. In particular we find that of the 467 COGs that are allocated to KEGG pathways in which they are currently not annotated, more than 60% share at least one GO category with the pathway genes at a depth of 5 or greater, and more than 70% at a depth of 4 or greater. One way to assess the methods as a function of both coverage and PPV is to estimate the entropy change or reduction in uncertainty in the unannotated gene population. An entropy measure tends to give coverage a high weight, since even low PPV resolves some uncertainty, whereas when no prediction can be made about a gene, no uncertainty is resolved (Table 2). It’s obvious that while identical profiling gives the best average positive predictive value per gene ( AC =1.70/2.26), CE resolves the greatest amount of uncertainty[12]. Different thresholds indicate a tradeoff between accuracy and coverage. For SGA allocation at C * 0.1 gives the largest decrease in entropy while for CE, C * 0 . It should be noted that at C * 0 , CE takes account of pairwise correlations between the target and all genes in a category; i.e. although there is no cutoff, profile pair correlations still enter strongly into the decision. The low optimal cutoff values are a consequence of the definition of entropy reduction which gives high weight to coverage. It is evident that the total information gain may not always be the criterion of choice, depending on the question of interest to the user. Validation and Predictions for Model Organisms using the GO Evaluations based on orthologs are in a sense reference evaluations, not precisely applicable to any particular genome. In order to discuss performance in a specific instance, and to better compare different methods, we have evaluated A0 and coverage for E coli against GO. Figure 3 plots results obtained using correlation enrichment. At all correlations A0 as a function of GO depth drops to a minimum and then increases, reflecting a tradeoff between the rates at which number of correctly assigned genes and the number of assignable genes change with depth. Coverage is relatively invariant across GO depths. .For a correlation threshold of 0.55 ( p* 1011 ), A0 is 78% at depth 7, where the coverage is 48%, and 81% at the most specific depth (level 10), where the coverage is 45%. When C * = 0.9 8 ( p* 1018 ) A0 is above 90% for all annotation specificity levels, and comparable to that obtained with identical profiles, while the minimum A0 (at level 7) gives coverage of (185/1225)15%. Some perspective on performance can be obtained by comparison, to the extent possible, with other high performance computational and experimental methods. In particular, McDermott and Samudrala [13] estimated A0 using, in essence, SGA with mutual information [5, 14] They found that A0 as a function of GO depth varies from 98% at the least specific level (3) where coverage is 50%, to 10% at the most specific level (10), the latter being about eight fold lower than the result using CE at C * = 0.55 at comparable coverage. With respect to experimental high throughput methods, Date and Marcotte (2003 #3) find that SGA using MI has accuracy comparable to that of yeast two hybrid screens. Intergenomic Correlations The above assessments, and those commonly made of phylogenetic profiling, usually leave aside discussion of a fundamental limitation; viz, the confounding influence of correlations between genomes, as opposed to correlations between genes. While we do not report a complete study of the effect of inter-genome correlations, we estimated its potential influence by collapsing those genomes that are phylogenetically close, essentially assuming that all correlations between gene pairs that are present within a group of related genomes are the result of genome correlation rather than gene correlation. We find that, for this conservative model, genome correlations have only a small effect on the performance of the method given a reasonable number of lineages. In fact, when 66 genomes are collapsed to as few as 32 lineages based on their phylogenetic distances, the corresponding change in PPV for the same coverage is always less than 1%. Functional and Evolutionary Modules Annotating the COG Ontology COG functional categories provide a low resolution, but fully resolved annotation. Because the COG ontology is a one to one gene: functional category map, performance assessment is relatively direct; in particular, PPV = A0 (eq 17). 9 The number of CE annotated genes by category falls more or less linearly, and is generally concordant with the distribution of previously annotated genes among the categories; the latter having translation (J), amino acid metabolism (E) and energy production and conversion (C) as the largest categories, and signal transduction (T) and cell division (D) as the smallest. Such mimicry is inherent in the CE method. The full set of 4,826 genes, annotated and unannotated, cluster into groups of various sizes (figure 4). An all against all profiling at a threshold of C * = 0.55, returns a 926 gene network fragment, which was processed using the VisANT visual mining tool ([15], visant.bu.edu) to produce figures 5 and 6. The network contains 677 annotated orthologs, 463 of which are assigned correctly and unambiguously to their COG category. In addition, (926 - 677 =) 249 previously unannotated orthologs are assigned explicitly to one or another category. Sets of genes assigned to the same COG functional categories are grouped together into meta-nodes (5B), each containing genes that are classifiable as (Figure 5C) true or false positives (for annotated genes), or predictions (for unannotated genes). In particular, of the 82 genes allocated to category H (coenzyme metabolism), 62 of 74 are annotated in category H (PPV = 0.84), and eight others are predictions. Since the PPV is high, the predictions are likely to be correct. All the predictions using COG functional categories can be web accessed [16]. A more detailed version of the TP set (Figure 5) reveals two strikingly dense clusters—one with 7 orthologs, the other with 11. All the elements of the latter participate in Porphyrin and chlorophyll metabolism pathway (00860). The cluster is a highly interdependent functional module and it is also strikingly conserved as demonstrated by its aligned profiles (Figure 6). The elements in the seven member cluster are not annotated in KEGG. However, four of them are annotated in GO and they all share GO category 0006777: molybdopterin cofactor biosynthesis, at depth 8. It therefore appears likely that the remaining 3 COGs are important components of molybdopterin cofactor biosynthesis in one or more genomes. These results indicate the power of CE to uncover evolutionarily conserved highly specific functional modules, and to reliably assign previously unannotated genes to these modules. 10 We have already commented on the problem of comparative performance assessment resulting from disparate definitions in the literature. A specific illustration is shown in Figure 7 which compares two different performance measures, PPV, and a measure of accuracy used by [17]). It is evident that PPV provides a much less impressive picture of SGA, showing almost a 2-fold difference in accuracy over a range in coverage. Cliques, clusters and inference quality The above results are obtained by assigning genes to categories based on CE or SGA, and then obtaining clusters by displaying links between intra-category genes that meet the threshold condition. This method is useful if the primary goal is to annotate genes, but it is not efficient if the primary goal is to find sets of tightly linked clusters. The simplest procedure for finding clusters is to set a threshold, discard all genes that don’t meet it, and connect those whose profiles meet or exceed it. We examine cliques (fully connected clusters) and quasi-cliques[18] in our functional linkage network at different C * and estimate their predictive power. As the threshold decreases from C * = 0.9, the number of clusters containing more than 3 nodes increases, peaking at C * = 0.66 ( p* 1013 ) and then declining as the nodes coalesce into increasingly larger clusters (Figure 8). At C * = 0.9, four clusters are cliques, including a 9-component flagella assembly pathway (Figure 9a). There are nine cliques and quasi-cliques at C * = 0.81 ( p* 1016 ) and twenty (ranging from size 4 to 19) at C * = 0.71 ( p* 1014 ). Of the nine fully determined cliques (functions of all genes known) ranging in size from 4 to 10, 8 are homogeneous in COG function, and 5 are homogenous in both KEGG and COG functions. Clusters of unannotated orthologs begin to appear at 10-16. Of the 9 cliques or quasi cliques with unannotated orthologs that are uncovered at C * = 0.71, one has 12 genes, with 4 unannotated, and 8 annotated in COG category M (membrane biosynthesis) and KEGG pathway lipopolysaccharide biosynthesis. Figure 9 shows some examples. The fully annotated nine node (homogeneous in the flagella assembly pathway), and six-node cliques (homogeneous in the histadine metabolism pathway) clearly qualify as both evolutionary and functional modules. 11 The other three cliques (c) - (e), obtained at C * = 0.71 have both annotated and unannotated nodes. They are evolutionary and very likely also functional. The 12node module (c) has eight of its members in the KEGG lipopolysaccharide metabolism pathway. Since it is fully connected, with all linkage strengths equal to or greater than C * = 0.71, enrichment for he lipopolysaccharide metabolism pathway is very strong, and each of the unknown COGs is almost certainly associated with that function. A very weak enrichment-based lower bound on PPV is 0.78. Similar remarks hold for the six node clique, which has four genes implicated in the aminosugars metabolism pathway. The smallest clique has two of its 4 genes annotated in different pathways—ubiquinone biosynthesis and oxidative phosphorylation. It therefore appears plausible that the two unannotated COGs are likely to be mitochondria related, participating in the latter part of the respiratory chain during ubiquinone biosynthesis. They are also unambiguously assigned to COG category P, suggesting a possible role in transport across the mitochondrial membrane. Pressure to Reduce Pleiotropy The premise of profiling is that functionally related pairs tend to be conserved. The tighter the relation is (e.g. the deeper the level of GO depth), the greater the correlation between profiles. As functional tightening increases, however, there is pressure not just to co-evolve, but to reduce the pleiotropy of the pair; i.e. the function of each gene tends to become increasingly specialized and restricted to the functions (e.g. pathways) of its partner (figure 10 upper curve). Such pressure reduces informational entropy, just as physical pressure reduces thermodynamic entropy, all other variables held constant. Hence we see convergence between functional specificity reflected in a reduction in the average number of paths per linked gene (upper curve) and the increase in the number of correctly predicted paths per gene (lower curve). Such tendency toward increased specialization is perhaps most pronounced for biosynthetic pathways ([19]) as would be expected of processes that are strongly interdependent and forced to co-evolve, even in the presence of adaptive change in more loosely associated genes. 12 Materials and Methods The dataset We adhere to the conventions of the COG database (http://www.ncbi.nlm.nih.gov/COG/) ([20, 21]) and construct profiles only for genes that occur in at least three lineages. All paralogs are collapsed; i.e. a set of closely related genes in a given lineage is treated as a single entity. The analysis was performed for 4,873 clusters of orthologs (COGs) from 66 fully sequenced microbial genomes in the three domains of life. Accuracy is evaluated against the Kyoto Encyclopedia of Genes and Genomes (http://www.genome.ad.jp/kegg/kegg2.html, [22]); the Gene Ontology Consortium (2000) (http://www.geneontology.org/ [23]); 23 COG broad functional categories; annotations for 6059 Saccharomyces cerevisiae ORFs (SGD http://www.yeastgenome.org/) and 4410 E.coli K12 ORFs (EcoCyc http://ecocyc.org/). Of the 206 biochemical pathways in KEGG, we used 133 (mostly metabolic pathways); in particular those that are generic. These pathways contain a total of 1,368 orthologs. Sets of 20 or more COGs that have the same phylogenetic profile were eliminated from the dataset. Pathway assignments Definitions Eq 3b can be used to assign unannotated genes in a number of ways. Here we explore two relatively simple criteria and rigorously evaluate the reliability of both by quantitative assessments of the full contingency matrix as a function of coverage. In particular, for each target gene we assess true positives (TP), i.e. the number of categories (KEGG pathways; GO categories, or COG ontology categories) correctly assigned; false positives (FP), the number of categories incorrectly assigned; true negatives (TN), the number categories to which the target is not assigned, and in which it is not annotated; false negatives (FN) the number of categories to which the target is not assigned and in which it is annotated. The sensitivity (SEN), specificity (SPE), accuracy (ACC) and positive predictive value (PPV) are functions of these four quantities 13 SEN TP /(TP FN ) SPE TN /(TN FP ) ACC (TP TN ) /(TP TN FP FN ) PPV TP /(TP FP ) (4a) (4b) (4c) (4d) Related metrics SPE-ACC For category allocation, specificity and accuracy will be quantitatively very similar; i.e. true negatives will invariably be much greater than true positives and false negatives, owing to the fact that the vast majority of genes are in a small fraction of all pathways. Consequently we expect SPE ACC . SEN-A0 Whereas SPE and ACC are quantitatively similar, SEN and the fraction of genes that are correctly allocated at least once are qualitatively similar, as explained below. The similarity is strong enough so that they provide the same measure of performance. Hence of the 5 measures, only three are independent. These are traditionally taken as SEN, SPE and PPV. (A fourth measure, negative predictive value, which adds little to the discussion, is omitted in the interest of brevity). Performance is measured by their functional dependence on coverage. These definitions are introduced in terms of a particular gene. Passing to population averaged quantities is in principle direct, although in practice it involves some care because of cross correlations between categories. The final quantity of interest is coverage, defined as the fraction of genes (unannotated or annotated) that can be linked to at least one annotated gene. Positive Predictive Value Each measure serves its own purpose. However, a quantity of natural interest for predictions based on thresholding is the fraction of assignments that are correct; i.e. the positive predictive value. By definition, the population averaged positive predictive value is 14 1 PPV Na Na PPV I 1 (5) I where N a is the total number of annotated genes linked at C* and PPVI, the positive predictive value for target gene I , is given by eq 4d. Analogous equations hold for the other measures of performance. It is instructive to express the PPV as a product of two factors: the fraction of genes that are correctly assigned to at least one functional category ( A0 ), and the average fraction of those assignments that are correct ( AC ). Let Nc be the number of genes that are assigned correctly to at least one functional category. Then A0 1 AC Nc Nc Na Nc TPI 1 Nc I 1 TPI FPI (6) Nc PPV I 1 I (7) and the population averaged positive predictive value is PPV A0 AC (8) i.e. AC is the average positive predictive value of genes that are assigned correctly at least once, and A0 is the fraction of annotated genes assigned correctly at least once. Other measures of reliability cannot be similarly decomposed. Although A0 is sometimes used as a measure of PPV (and sometimes referred to as accuracy), in general, A0 is a very poor measure of PPV and provides an overly optimistic assessment of performance. We apply and assess two decision criteria one based on SGA, the other on CE. Standard Guilt by Association An unannotated target gene generally meets the threshold condition C ( z | N , x, y ) C * 15 with multiple genes, and each associated gene typically participates in more than one process. The target is of necessity assigned to all categories of the gene to which it is linked. In order to develop performance measures, let i be the number of the categories that contain the target, I , whose function is to be predicted; let j ( I , J ) be the number of categories that contain a gene J whose profile correlation with I meets the threshold C*, and let k ( I , J ) denote the number of common categories; where 0 k ( I , J ) min(i, j ) .The target gene is therefore correctly assigned to TP = k categories, and incorrectly assigned to the remaining FP = j k categories. Also TN = T – i – j + k and FN = i - k, where T = 133 is the total number of pathways. Consequently, the PPVI ( J ) with which gene I is assigned using linked gene J is PPVI ( J ) k j (9) Note that the maximum PPVI ( J ) is not necessarily 1, but min(i, j ) / j . For j i, PPVI 1 , whereas when i > j, PPVI ( J ) can become 1 when the pathways of J are a subset of those of I. The positive predictive value for gene I is obtained by taking sums over all genes to which it is correlated. Nc ( I ) G(I ) PPVI k (I , J ) J 1 G(I ) j(I , J ) J 1 k (I , J ) J 1 G(I ) j(I , J ) (10) J 1 where G(I) is the number of genes correlated with gene I and Nc(I) is the subset of genes in G(I) that share at least 1 category with gene I. Here union symbol is used instead of a sum to indicate avoidance of double counting when a category has more than a single gene linked to the target gene. Ac is given by substituting eq 10 into eq 7. Correlation Enrichment * Suppose that the target gene I is correlated with g I other genes ( C C ), and let m1 , m2 ,..., mr be the number of correlated genes in categories C1 , C2 ...Cr , where r g I , the equality holding only when each gene is in one category. 16 Further, let C1' , C1' ,..., CT' I denote the TI categories the target gene is in. For each of the r categories that have 1 or more genes meeting the correlation threshold with I, define a weighted sum score, sv mv sv [ log PI , g j ] j 1 v =1… r (11) is a positive integer. Thus a linked pathway is weighted by a combination of the number of genes in the pathway, which exceed the threshold, and the similarity of those genes to the one being tested. Unweighted ranking, in which only the number of genes is used, is a special case with = 0. Tests using different indicate that = 4 is optimal. PI , g j is calculated from equation (1) using the profile of gene I and that of gene g j in category Cv . The category scores sv are ranked in descending order and the target gene is allocated to the top r0 categories. The number of true positives is the intersection between the categories the target is in (TI), and these r0 categories. Then FP= r0 -TP, FN = TI – TP, TN = T- r0- TI +TP r0 TI (C C 'j ) TP PPVI TP FP j 1 1 r0 0, Cv C 'j where ( j v) { 1, Cv C 'j (12) and 0 PPVI 1 An analysis of KEGG and GO indicates that the average number of functional categories per gene is between 2 and 3. It would therefore seem reasonable to take r0 = 3 for KEGG and GO, where a relatively large number of categories is available; i.e. we allocate to at most 3 categories. We use the more stringent condition r0 = 1, for the relatively coarse grained COG ontology. For COG categories, PPVI 1 or 0 , PPV A0 Nc / N a . (13) 17 Finally it is useful to ask how much each method reduces the uncertainty in a population of unannotated genes. This can be answered by estimating the Shannon information; i.e. the entropy drop [12]. List of abbreviations SGA: Standard Guilt by Association CE: Correlation Enrichment GO: Gene Ontology KEGG: Kyoto Encyclopedia of Genes and Genomes COG: Clusters of orthologous groups PPV: Positive Predictive Value SEN: Sensitivity SPE: Specificity Acknowledgements This work has been supported by NIGMS, NIH (P20 GM66401). Scripts related to this study are available upon request. References 1. 2. 3. 4. 5. 6. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86-90. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402(6757):83-86. Yanai I, Derti A, DeLisi C: Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proc Natl Acad Sci U S A 2001, 98(14):7940-7945. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, 96(6):2896-2901. Date SV, Marcotte EM: Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol 2003, 21(9):1055-1062. Gaasterland T, Ragan MA: Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb Comp Genomics 1998, 3(4):199-217. 18 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Huynen M, Snel B, Lathe W, 3rd, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 2000, 10(8):1204-1210. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285-4288. Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19(12):1524-1530. Relationship between mutual information MI and probability P. [http://visant.bu.edu/jiewu/MI.htm]. Aravind L: Guilt by association: contextual information in genome analysis. Genome Res 2000, 10(8):1074-1077. Entropy Change in the unannotated gene population. [http://visant.bu.edu/jiewu/entropy.htm]. McDermott J, Samudrala R: Enhanced functional information from predicted protein networks. Trends Biotechnol 2004, 22(2):60-62; discussion 62-63. Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science 2004, 306(5701):1555-1558. Hu Z, Mellor J, Wu J, DeLisi C: VisANT: an online visualization and analysis tool for biological interaction data. BMC Bioinformatics 2004, 5(1):17. COG predictions using CE. [http://visant.bu.edu/jiewu/COGpredictions.htm]. Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science 2003, 302(5643):249-255. At the thresholds in [16], only two clusters are not cliques, the least cliquish node in one has a clustering coefficient of 0.91, and the least cliquish node in the other has a clustering coefficient of 0.74, where the clustering coefficient of a node = total number of observed links between its neighbors normalized by the maximum possible number of links between neighbors. Snel B, Huynen MA: Quantifying modularity in the evolution of biomolecular systems. Genome Res 2004, 14(3):391-397. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN et al: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4(1):41. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29(1):22-28. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28(1):27-30. Creating the gene ontology resource: design and implementation. Genome Res 2001, 11(8):1425-1433. Figure Legends 19 Figure 1. Measures of performance as a function of coverage using SGA to allocate unknown genes to KEGG pathways. Figure 2. Performance as a function of coverage with allocation using CE. The measures are defined in Material and methods. Figure 3. PPV as a function of GO category from broadest (level 3) to most specific (level 10). The labels appearing along the lowest curve C = 0.2 (P=10-4) are fractional coverages; i.e. the number of annotated genes linked to at least one other gene, normalized by the total number of annotated genes at the depth shown. The other curves display only the numerators, the denominators being unchanged. Figure 4. An all against all screen short, at C* = 0.55, of the 4286 orthologs in the COG database using SGA 926 genes (677 annotated; 249 unannotated) are linked to at least one annotated gene. Each gene is unambiguously assigned to a unique COG functional category. Of these 677 annotated genes, 463 are correctly assigned; In total 1843 out of 4286 cogs are unannotated in the COG classification (A) Complete 926 gene network. (B) meta-network of genes from (A). Each group represents a set of genes allocated to a COG functional category. (C) Detail of functional category H, coenzyme metabolism. (D). Of the 926 linked genes, 82 are in category H. 62 of them are true positives (green) and 8 are predictions (red). The remaining 12 are annotated in a different functional category and are therefore putative false positives. The minimum PPV for category H is therefore 62/74 = 0.84 the averaged PPV for all categories is 68%.Refer to the COG web site for definitions of categories. Coverage = 62/142 = 44% neglecting predicted genes. Figure 5. Expanded view of true positive and prediction cluster in figure 5C, showing two strikingly dense clusters of size 11 and 7 respectively. Figure 6. Phylogenetic Profiles of an 11-member cluster of orthologs across 66 genomes uncovered by CE. Green represents absence and red, presence. Figure 7. Comparison of accuracy against coverage of COG ontology allocations, illustrating the effect of definition on assessment of performance. The method in both cases is SGA. The upper curve is the number of links between gene pairs having the same function (COG category H) normalized by the number of links formed by genes with that function [17]. The lower curve uses PPV (correctly predicted pathways normalized by total predicted pathways) as accuracy measure. Figure 8. Number of clusters (size >= 3) as a function of C*. Figure 9. Evolutionarily conserved fully connected clusters. Edge coding: black, C* = 0.9 (P*=1018 ); yellow, C* = 0.81 (P*=10-16); blue, C* = 0.71 (P*=10-14). Green nodes correctly annotated; red nodes unannotated. (a) One of 4 cliques uncovered at C* = 0.9. All genes are known and are in the flagella assembly pathway. (b) One of 9 clusters at 10-16, ranging in size from 4 to 13 nodes. All genes 20 are annotated to the KEGG histadine metabolism pathway and to the COG amino acid metabolism and transport category. (c), - (e) are examples of mixed annotated-unannotated clusters, with the annotated sets homogeneous in function. Lower bounds on PPV for assigned functions are 79% (C*=0.71) and 89% (C* = 0.81). Figure 10. Pressure to reduce pleiotropy. The average number of pathways per linked gene decreases for stringent C* while average number of correctly predicted pathways per linked gene increases. Tables and Captions Table 1. Pathway allocation performance of using exactly matching phylogenetic profiles. AA (UU) denotes pairs in which both genes are annotated (unannotated). N is the number of links. G is the number of genes that form those links (unannotated genes in AU). N * is the number of links between genes that share at least one path; G* is the number of such genes. PPV, A0, AC, sensitivity and specificity are defined as in Material and Methods. N N* G G* AA 288 254 249 234 AU 271 (239) 159 (149) UU 1090 NA 603 NA PPV A0 AC SEN SPC 85% 94% 90% 91% 99% Table 2. Performance evaluation for different methods using entropy change. G is the number of genes that can be assigned to at least one pathway. Gc is the number of genes assigned correctly to at least one path. Gp is the average number of pathways assigned per gene. Gpc is the average number of correct pathways for correctly assigned genes. H is the entropy drop in the system. Nos. in parenthesis are estimated from annotated genes. Method/Thres hold G Gc Gp Gpc Identical A 249 234 2.03 1.53 profiling U 159 (149) 2.26 (1.70) A 1355 1321 107.4 2.71 U 2897 (2824) 84.6 (2.13) A 1273 1136 62.81 2.45 SGA 2 4 H 1277.15 2777.63 2537.98 21 U 2170 8 0 CE 8 15 (1936) 45.74 (1.78) A 801 634 15.76 1.71 U 723 (572) 17.64 (1.91) A 1368 765 3.0 1.29 U 2918 (1632) 3.0 (1.29) A 801 532 3.0 1.33 U 723 (480) 3.0 (1.33) A 178 168 3.0 1.49 U 27 (25) 3.0 (1.49) 1979.48 6238.50 2128.69 151.0 22