Supplementary Data

advertisement
SUPPLEMENTARY DATA of Assignment of Orphan Enzymes to Bacterial Metabolic
Pathways: an application of genomic location information
Table S1. The genomic start and end positions of 139 folding domain boundaries of E. coli K12.
ID
Start
End
ID
Start
End
ID
Start
End
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
4638329
4626570
4609026
4597464
4552384
4482303
4422409
4417648
4404008
4380341
4333613
4254489
4251911
4213232
4205555
4173049
4172070
4164308
4156246
4152870
4151121
4131529
4130290
4106537
4099453
4056057
4054362
3962252
3937242
3911691
3851011
3848749
3825314
3812410
3738981
3728788
3727394
3723341
3671336
3575615
3544221
3523445
3497223
3489642
3431582
3371598
3348481
4638425
4626878
4609419
4597718
4552599
4482463
4422539
4418003
4404213
4380666
4333717
4254660
4252066
4213501
4211257
4173967
4172099
4170080
4156513
4153024
4151719
4131858
4130639
4106857
4099713
4056430
4054648
3962388
3937208
3911853
3851576
3848825
3825483
3812517
3739132
3729154
3727466
3723436
3671385
3575754
3544581
3523611
3497470
3489747
3431712
3371720
3348711
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
3309190
3242763
3202611
3169855
3161503
3155472
3151445
3090850
3079657
3057347
3031635
2938121
2929832
2924218
2920122
2890601
2889920
2870842
2828800
2820661
2812755
2802483
2752786
2720635
2712252
2687528
2650309
2640866
2590984
2547428
2537605
2531402
2494587
2446462
2435873
2420623
2371300
2289169
2263317
2244791
2241674
2230750
2176586
2175226
2166330
2063788
2022615
3309437
3243126
3202716
3169901
3161737
3155672
3151585
3090959
3079935
3057775
3031679
2938165
2929887
2924330
2920557
2890679
2890236
2871036
2828797
2820730
2812905
2802837
2752918
2720749
2712461
2687693
2650516
2641151
2591094
2547668
2537739
2531786
2494943
2446628
2435972
2420671
2371294
2289380
2263472
2245085
2241932
2230900
2176843
2175534
2166736
2064329
2022659
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
1986569
1959819
1873600
1821309
1755681
1727018
1723944
1667616
1570069
1506766
1447042
1338118
1287847
1285749
1276841
1268242
1262723
1250061
1246599
1194174
1164908
1143590
1124549
1121830
1078105
992457
939943
879080
785908
783108
702834
695499
674006
628936
609311
507304
466536
429700
395511
330720
223408
174882
149601
141967
134212
1986740
1959996
1873697
1821539
1755745
1727111
1724047
1667723
1570431
1506858
1447100
1338267
1287897
1286310
1277180
1268391
1262937
1250289
1246919
1194346
1165308
1143725
1124785
1121936
1078528
992500
940182
879077
786066
783105
703167
696736
674241
629117
609477
507442
466636
429829
395863
331595
229167
175107
149715
142008
134388
Results S1: Applications on each assigned enzyme through self-rank validation
To evaluate the effectiveness of the proposed method, we first do self-rank validation for assigned
enzymes in E. coli as follows. In current formulation, we do not consider the enzyme assigned by
multiple genes, which form a protein complex to catalyze the corresponding reaction, giving rise
to 416 candidates in total. The self-rank of an enzyme-encoding operon is the identified rank of
that operon in a candidate operon list including the leave-out one using our algorithm (see the
detailed framework in Supplementary Figure S2). It is worth mentioned that the candidate list for
each leave-out operon used in our model is identified by two enzyme function prediction programs,
not trivially consider all the proteins without function (see details for generation of candidate lists
in METHODS AND MATERIALS), as the more confident the candidate list is, the better the
performance of the metabolic pathway assignment has. Then, based on our model in formula (4),
we can get self-rank for each of the 416 assigned enzymes in E. coli. Figure S1a showcases the
overall cumulative tendency of coverage across top-10 predicted enzyme-encoding candidates.
We can find that 68% leave-out operons have the rank one and all top-10 predictions can cover
96% assigned operons. As far as we know, ADOMETA, integrating co-expression and coevolution information, can only cover 60% of the assigned operons in its top-10 predictions for
corresponding reactions [1], indicating the significant improvement of performance by ORENGE
(Figure S1b) in orphan enzyme assignment. Meanwhile, the improved performance on self-rank
validation shows that the genomic location information in our model is a substantial complement
to other classical information (see relevant details in Supplementary Result S2 and Figure S3).
Figure S1: (a) the performance of self-rank validation for 416 leave-out pathway holes in E. coli; and (b)
comparison of the percentage of currently assigned E. coli genes, falling in the top 10 predictions for their
corresponding leave-out pathway holes, between ORENGE and ADOMETA.
ORENGE
Figure S2. A schematic diagram for the self-rank validation process.
Result S2: Based on above self-rank validation experiments, we found that the more metabolic
pathways a leave-out enzyme involved in, the better of its self-rank validation performance.
Specifically, only 62% leave-out enzymes can rank first in corresponding candidate list if they
involved in one known pathway; while this number can rise to 83% for the enzymes involved in at
least four pathways (Figure S3a). The similar phenomenon happens when we consider the
number of relevant operons for each leave-out operon, which represents the operons belonging
to its involved pathways and also locates in its specific chromosomal folding domain (Figure S3b).
58% of all the 416 enzymes have the rank one in corresponding candidate list if they do not have
any relevant operons based on current metabolic annotation; and this percentage will increase
by 20% if the enzymes have at least three relevant operons. These observations can further
support our preliminary hypothesis that (i) the more metabolic pathways affect an encoding
operon the narrower its possible genomic location is; and (ii) the relevant operons can really help
assign back the leave-out operons through the consideration of chromosomal folding structure.
Figure S3. (a) the relationship between assignment performance and the number of involved pathways of
the leave-out operons, where the red, blue and green bars represent the coverage of leave-out operons by
top-1, top-5 and top-10 predictions, respectively; and (b) the similar relationship between assignment
performance and the number of the relevant operons of the leave-out enzymes.
Method S1: Assessment of evolution conservation of genes with their metabolic pathway
neighbors
We take E. coli as our target genome and picked a species in each genus of proteobacteria as
reference genomes, totally getting 216 genomes, from NCBI (2011-11-01). Then for each genes
in E. coli, we predicted its orthologous genes in the reference genomes with a recently published
program named GOST [2], and the evolution conservation score are as following,
𝐸𝑉(π‘₯, 𝑃) = π‘šπ‘Žπ‘₯π‘¦πœ–π‘ƒ {𝑁(π‘₯, 𝑦)}
(1)
where x is the query gene and P is the pathway including gene x; for each gene y in P, N(x, y)
represents the times of co-occurrences in a kind of transcriptional operon package, called uberoperon [3] using 216 reference genomes.
Result S3: Comparison of the impacts of different information to the pathway-hole
assignment problem
Expression and phylogeny information have been proven to be useful for assigning genes to
orphan enzymes [4-6]. Here we systematically compare genomic location information and these
two kinds of information regarding to the pathway hole self-rank validation performance (see
details in METHODS AND MATERIALS for how to utilize co-expression information and Method
S1 for co-evolution information). We get a conclusion that different information has special
tendentiousness and location information is the strongest clue to this problem among the three.
In Figure S4a, we can see that, by picking top-X (X=1, 2… 10) in each candidate list, the number
of covered leave-out enzymes using location information (the green line) has obvious advantages
than the other twos (the yellow and red lines). If we just consider the Top-1 candidate as the
‘correct’ one, we can see that there are 281, 202 and 199 holes being assigned correctly for
location, expression and phylogeny information, respectively (Figure S4b). 191 and 171 correct
cases based on phylogeny and expression are also included in our prediction too. Besides, there
are 49 unique correct cases for location which is much more than the other two (28 and 7).
Figure S4: (a) The comparison of performance of using different information or information combination;
and (b) comparison of correct assignment by location information against expression and phylogeny data.
It is fairly orderly to ask what will happen if we combine all associate information together. The
answer is definitely getting better as the complementarities for the three features. Here we
normalize relative strength of different association scores using a machine learning method,
Adaboost [7]. This method can generate alternating decision trees as classifiers and does not
assume independence of the associations which means the subsequent classifiers built are
related to previous classifiers, especially in favor of those misclassified instances. The
performance of Adaboost method corresponds to the twisted black lines in Figure S3a, which
suggests combine different information can improve the performance and basically with the same
effect.
REFERENCES
1.
Orth, J.D. and B.O. Palsson, Systematizing the generation of missing metabolic knowledge.
Biotechnol Bioeng, 2010. 107(3): p. 403-12.
2.
Li, G., et al., A new framework for identifying cis-regulatory motifs in prokaryotes. Nucleic Acids
Res, 2011. 39(7): p. e42.
3.
Che, D., et al., Detecting uber-operons in prokaryotic genomes. Nucleic Acids Res, 2006. 34(8): p.
2418-27.
4.
Khatri, P., M. Sirota, and A.J. Butte, Ten years of pathway analysis: current approaches and
outstanding challenges. PLoS Comput Biol. 8(2): p. e1002375.
5.
Kharchenko, P., et al., Identifying metabolic enzymes with multiple types of association evidence.
BMC Bioinformatics, 2006. 7: p. 177.
6.
Peregrin-Alvarez, J.M., C. Sanford, and J. Parkinson, The conservation and evolutionary
modularity of metabolism. Genome Biol, 2009. 10(6): p. R63.
7.
Zhang, T., Statistical behavior and consistency of classification methods based on convex risk
minimization. Annals of Statistics, 2004. 32(1): p. 56-85.
Download