SUPPLEMENTARY DATA of Assignment of Orphan Enzymes to Bacterial Metabolic Pathways: an application of genomic location information Table S1. The genomic start and end positions of 139 folding domain boundaries of E. coli K12. ID Start End ID Start End ID Start End 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 4638329 4626570 4609026 4597464 4552384 4482303 4422409 4417648 4404008 4380341 4333613 4254489 4251911 4213232 4205555 4173049 4172070 4164308 4156246 4152870 4151121 4131529 4130290 4106537 4099453 4056057 4054362 3962252 3937242 3911691 3851011 3848749 3825314 3812410 3738981 3728788 3727394 3723341 3671336 3575615 3544221 3523445 3497223 3489642 3431582 3371598 3348481 4638425 4626878 4609419 4597718 4552599 4482463 4422539 4418003 4404213 4380666 4333717 4254660 4252066 4213501 4211257 4173967 4172099 4170080 4156513 4153024 4151719 4131858 4130639 4106857 4099713 4056430 4054648 3962388 3937208 3911853 3851576 3848825 3825483 3812517 3739132 3729154 3727466 3723436 3671385 3575754 3544581 3523611 3497470 3489747 3431712 3371720 3348711 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 3309190 3242763 3202611 3169855 3161503 3155472 3151445 3090850 3079657 3057347 3031635 2938121 2929832 2924218 2920122 2890601 2889920 2870842 2828800 2820661 2812755 2802483 2752786 2720635 2712252 2687528 2650309 2640866 2590984 2547428 2537605 2531402 2494587 2446462 2435873 2420623 2371300 2289169 2263317 2244791 2241674 2230750 2176586 2175226 2166330 2063788 2022615 3309437 3243126 3202716 3169901 3161737 3155672 3151585 3090959 3079935 3057775 3031679 2938165 2929887 2924330 2920557 2890679 2890236 2871036 2828797 2820730 2812905 2802837 2752918 2720749 2712461 2687693 2650516 2641151 2591094 2547668 2537739 2531786 2494943 2446628 2435972 2420671 2371294 2289380 2263472 2245085 2241932 2230900 2176843 2175534 2166736 2064329 2022659 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 1986569 1959819 1873600 1821309 1755681 1727018 1723944 1667616 1570069 1506766 1447042 1338118 1287847 1285749 1276841 1268242 1262723 1250061 1246599 1194174 1164908 1143590 1124549 1121830 1078105 992457 939943 879080 785908 783108 702834 695499 674006 628936 609311 507304 466536 429700 395511 330720 223408 174882 149601 141967 134212 1986740 1959996 1873697 1821539 1755745 1727111 1724047 1667723 1570431 1506858 1447100 1338267 1287897 1286310 1277180 1268391 1262937 1250289 1246919 1194346 1165308 1143725 1124785 1121936 1078528 992500 940182 879077 786066 783105 703167 696736 674241 629117 609477 507442 466636 429829 395863 331595 229167 175107 149715 142008 134388 Results S1: Applications on each assigned enzyme through self-rank validation To evaluate the effectiveness of the proposed method, we first do self-rank validation for assigned enzymes in E. coli as follows. In current formulation, we do not consider the enzyme assigned by multiple genes, which form a protein complex to catalyze the corresponding reaction, giving rise to 416 candidates in total. The self-rank of an enzyme-encoding operon is the identified rank of that operon in a candidate operon list including the leave-out one using our algorithm (see the detailed framework in Supplementary Figure S2). It is worth mentioned that the candidate list for each leave-out operon used in our model is identified by two enzyme function prediction programs, not trivially consider all the proteins without function (see details for generation of candidate lists in METHODS AND MATERIALS), as the more confident the candidate list is, the better the performance of the metabolic pathway assignment has. Then, based on our model in formula (4), we can get self-rank for each of the 416 assigned enzymes in E. coli. Figure S1a showcases the overall cumulative tendency of coverage across top-10 predicted enzyme-encoding candidates. We can find that 68% leave-out operons have the rank one and all top-10 predictions can cover 96% assigned operons. As far as we know, ADOMETA, integrating co-expression and coevolution information, can only cover 60% of the assigned operons in its top-10 predictions for corresponding reactions [1], indicating the significant improvement of performance by ORENGE (Figure S1b) in orphan enzyme assignment. Meanwhile, the improved performance on self-rank validation shows that the genomic location information in our model is a substantial complement to other classical information (see relevant details in Supplementary Result S2 and Figure S3). Figure S1: (a) the performance of self-rank validation for 416 leave-out pathway holes in E. coli; and (b) comparison of the percentage of currently assigned E. coli genes, falling in the top 10 predictions for their corresponding leave-out pathway holes, between ORENGE and ADOMETA. ORENGE Figure S2. A schematic diagram for the self-rank validation process. Result S2: Based on above self-rank validation experiments, we found that the more metabolic pathways a leave-out enzyme involved in, the better of its self-rank validation performance. Specifically, only 62% leave-out enzymes can rank first in corresponding candidate list if they involved in one known pathway; while this number can rise to 83% for the enzymes involved in at least four pathways (Figure S3a). The similar phenomenon happens when we consider the number of relevant operons for each leave-out operon, which represents the operons belonging to its involved pathways and also locates in its specific chromosomal folding domain (Figure S3b). 58% of all the 416 enzymes have the rank one in corresponding candidate list if they do not have any relevant operons based on current metabolic annotation; and this percentage will increase by 20% if the enzymes have at least three relevant operons. These observations can further support our preliminary hypothesis that (i) the more metabolic pathways affect an encoding operon the narrower its possible genomic location is; and (ii) the relevant operons can really help assign back the leave-out operons through the consideration of chromosomal folding structure. Figure S3. (a) the relationship between assignment performance and the number of involved pathways of the leave-out operons, where the red, blue and green bars represent the coverage of leave-out operons by top-1, top-5 and top-10 predictions, respectively; and (b) the similar relationship between assignment performance and the number of the relevant operons of the leave-out enzymes. Method S1: Assessment of evolution conservation of genes with their metabolic pathway neighbors We take E. coli as our target genome and picked a species in each genus of proteobacteria as reference genomes, totally getting 216 genomes, from NCBI (2011-11-01). Then for each genes in E. coli, we predicted its orthologous genes in the reference genomes with a recently published program named GOST [2], and the evolution conservation score are as following, πΈπ(π₯, π) = πππ₯π¦ππ {π(π₯, π¦)} (1) where x is the query gene and P is the pathway including gene x; for each gene y in P, N(x, y) represents the times of co-occurrences in a kind of transcriptional operon package, called uberoperon [3] using 216 reference genomes. Result S3: Comparison of the impacts of different information to the pathway-hole assignment problem Expression and phylogeny information have been proven to be useful for assigning genes to orphan enzymes [4-6]. Here we systematically compare genomic location information and these two kinds of information regarding to the pathway hole self-rank validation performance (see details in METHODS AND MATERIALS for how to utilize co-expression information and Method S1 for co-evolution information). We get a conclusion that different information has special tendentiousness and location information is the strongest clue to this problem among the three. In Figure S4a, we can see that, by picking top-X (X=1, 2… 10) in each candidate list, the number of covered leave-out enzymes using location information (the green line) has obvious advantages than the other twos (the yellow and red lines). If we just consider the Top-1 candidate as the ‘correct’ one, we can see that there are 281, 202 and 199 holes being assigned correctly for location, expression and phylogeny information, respectively (Figure S4b). 191 and 171 correct cases based on phylogeny and expression are also included in our prediction too. Besides, there are 49 unique correct cases for location which is much more than the other two (28 and 7). Figure S4: (a) The comparison of performance of using different information or information combination; and (b) comparison of correct assignment by location information against expression and phylogeny data. It is fairly orderly to ask what will happen if we combine all associate information together. The answer is definitely getting better as the complementarities for the three features. Here we normalize relative strength of different association scores using a machine learning method, Adaboost [7]. This method can generate alternating decision trees as classifiers and does not assume independence of the associations which means the subsequent classifiers built are related to previous classifiers, especially in favor of those misclassified instances. The performance of Adaboost method corresponds to the twisted black lines in Figure S3a, which suggests combine different information can improve the performance and basically with the same effect. REFERENCES 1. Orth, J.D. and B.O. Palsson, Systematizing the generation of missing metabolic knowledge. Biotechnol Bioeng, 2010. 107(3): p. 403-12. 2. Li, G., et al., A new framework for identifying cis-regulatory motifs in prokaryotes. Nucleic Acids Res, 2011. 39(7): p. e42. 3. Che, D., et al., Detecting uber-operons in prokaryotic genomes. Nucleic Acids Res, 2006. 34(8): p. 2418-27. 4. Khatri, P., M. Sirota, and A.J. Butte, Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 8(2): p. e1002375. 5. Kharchenko, P., et al., Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics, 2006. 7: p. 177. 6. Peregrin-Alvarez, J.M., C. Sanford, and J. Parkinson, The conservation and evolutionary modularity of metabolism. Genome Biol, 2009. 10(6): p. R63. 7. Zhang, T., Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 2004. 32(1): p. 56-85.