Figure S1.
Flowchart for the construction of directional RNA-seq libraries.
C
A B
D
Figure S2.
Correlation of expression levels of all the genes between the GAII and
HiSeq platforms. Each dot represents a gene. The expression level is log of the NPKB values. A) PCC of expression levels for HS15min between GAII reads and HiSeq reads. B) PCC of expression levels for HS15min between GAII reads and 2 nd HiSeq reads. C) PCC of expression levels for M-P4h between GAII reads and HiSeq reads.
D) PCC of expression levels for M-P4h between GAII reads and 2 nd HiSeq reads. The duplicates for each sample are from the same biological samples sequenced twice using the HiSeq 2000 platform.
Figure S3. Distribution of the genes with more than the indicated percentage of their length covered by at least one read in the samples generated by Vivancos et. al
.
Less than 60% of genes have over 50% of their length covered by at least one read.
Figure S4. Distribution of the length of the uniquely mapped reads in the samples.
C
A
B b2627 b2628 b2629
Figure S5. Position-dependent non-uniform read coverage along of the operon b2628-b2627. Not the highly similar patterns of the non-uniform coverage under different culture conditions and growth phases. Although no TSS is documented for this operon in RegulonDB, we identified the position 2,763,486 as the TSS for the operon in E. coli K12 in five samples LB, HS15min, M-P0h, M-P2h, and M-P4h.
M-P4h
M-P2h
M-P0h
HS15min
LB
Figure S6. Detected
70 binding sites (Pribnow boxes) in the promoter regions of the
known TSSs and predicted TSSs. A) Pribnow box found by MEME [3] in 539 of the
1742 known upstream promoter sequences (25nt). B) Pribnow box found in the [-100,
100] regions of the predicted TSSs appearing in multiple samples at p-value ≥ 0.05. C)
Pribnow box found on the [-100, 100] regions of the predicted TSSs appearing only in one sample at p-value ≤ 0.05.
SRR031127
HP0834 HP0835
Figure S7. Recovery of TSSs in the H. pylori data by our algorithm. The light vertical line at position 887,519 indicates the TSS of the gene HP0835 in sample SRR031127 determined by dRNA-seq by Sharma et. al
[4], and our algorithm made the same
prediction.
A
HP0186 HP0187
SRR031130
SRR031128
SRR031127
SRR031126
B
HP1138 HP1139
SRR031130
SRR031129
SRR031128
SRR031127
SRR031126
Figure S8. Unrecovered TSSs determined by dRNA-seq by Sharma et. al
light vertical lines indicate the positions of the TSSs. A) The determined TSS at
869,888 of the gene HP0187 is located in body of the upstream gene HP0186. B) The determined TSS at 1,200,759 of the gene HP1138 is located in the body of the upstream gene HP1139.
Pr( X
k )
( 1
p ) k p
E ( X )
k
0
( 1
p k
0
( 1
p ) k p
k p ) k k
d p [ dp
(
k
0
( 1
p ) k
)]
( 1
p )
p ( 1
d p ) dp
1
p p
1 p
p
1
1
E ( X )
1
p
E ( X )
E ( X )
1
Figure S9 . Derivation of transition probabilities P
EE and P
NN
. The geometric distribution
Pr( X
k )
( 1
p ) k p
is used to model the number of failures before the first success. The length for the consecutive expression state E or non-expression
state N should follow a geometric distribution. Therefore the probability of staying in the expression state
P
EE
1
P
EN
E ( X )
E ( X )
1
, and the probability to transit from the expression state to the non-expression state is
P
EN
1
E ( X )
1
. Similar results can be derived for
P
NN and
P
NE
.
Figure S10 . Distributions of the lengths of ORFs and intergenic regions in the known operons in E. coli K12. A: Histogram of the lengths of ORFs (bin size =50nt). The curve is the geometric distribution with the success probability p = 0.0010483 estimated by the maximum likelihood method. B: Histogram of the lengths of intergenic regions within the known operons in
ReguonDB (bin size =20nt). The curve is the geometric distribution with the success probability p = 0.0023695 estimated by the maximum likelihood method. C: QQ-plot of the lengths of ORFs against the fitted geometric distribution. D: QQ-plot for the lengths of intergenic regions within the known operons against the fitted geometric distribution. Clearly,
unlike the distribution of the lengths of interoperonic regions, the lengths of ORFs cannot be fitted to a geometric distribution.
Table S1. Summary of the mapping results
Sample Platform
LB GAII
HS15min GAII+HiSeq
HS30min HiSeq
HS60min HiSeq
M-P0h HiSeq
M-P2h
M-P4h
HiSeq
GAII+HiSeq
Total reads
32,129,789
72,868,580
35,042,119
25,930,027
46,464,309
67,034,479
86,184,479
% Reads having adapter
Total reads after trimming
16.70%
60.29%
84.22%
31,767,554
72,586,098
34,979,745
80.37%
81.87%
76.30%
51.16%
25,905,637
46,342,018
66,962,875
85,795,131
Uniquely mapped reads
12,856,757
16,743,042
13,034,034
7,735,369
14,129,411
29,581,761
29,183,476
Multiple mapped reads
Reads failed to map
% Unique % Multiple % Failed
14,956,218 3,954,579 40.47%
45,758,784 10,084,272 23.07%
19,411,877 2,533,834 37.26%
47.08% 12.45%
63.04% 13.89%
55.49% 7.24%
15,403,470
27,602,193
2,766,798 29.86%
4,610,414 30.49%
31,717,549 5,663,565 44.18%
44,847,797 11,763,858 34.02%
59.46% 10.68%
59.56% 9.95%
47.37% 8.46%
52.27% 13.71%
Table S2. Effect of sequencing depth on the performance of TruHmm using sample
M-P4h as an example
Portion of total reads
10%
20%
40%
80%
100%
Total unique mapped nt
177,394,644
354,676,352
710,081,281
1,419,721,627
1,780,472,931
Sequencing depth
38.23
76.44
153.05
306.00
383.75
Sensitvity Specificity Accuracy Precision F-factor
86.60%
89.60%
94.50%
96.60%
97.20%
96.00%
91.00%
93.20%
97.10%
98.90%
89.00%
90.00%
93.50%
96.50%
97.70%
98.00% 92.00%
96.80% 93.00%
95.80% 95.20%
97.30% 96.95%
99.00% 98.10%
Total operons
2705
2513
2345
2133
2091
Zero-coverage positions (nt)
5,705,298
4,983,472
4,220,369
3,438,913
3,193,336
Table S3. Performance of TruHmm on the H. pylori dataset evaluated based on operon pairs
Sample Sensitvity Specificity Accuracy Precision F-factor
SRR031126 99.81% 85.51% 96.77% 96.24% 97.99%
SRR031127 99.62%
SRR031128 99.57%
SRR031129 99.80%
SRR031130 97.83%
90.85% 97.77% 97.60% 98.60%
94.96% 98.63% 98.73% 99.15%
95.19% 99.02% 99.02% 99.41%
98.21% 97.91% 99.56% 98.69%
Table S4. Performance of TruHmm on the H. pylori dataset of evaluated based on the entire operon structure
SRR031126
SRR031127
SRR031128
SRR031129
SRR031130
Sensitvity Specificity Accuracy Precision F-factor
99.74% 86.51% 95.40% 93.81% 96.68%
99.51% 89.58% 96.46% 95.57% 97.50%
99.59% 95.24% 98.14% 97.67% 98.62%
99.67% 95.37% 98.59% 98.46% 99.06%
98.67% 98.47% 98.61% 99.33% 99.00%
Table S5. Comparison of the parameters trained on the E. coli and H. pylori datasets using a window size 11nt and the leave-one-out strategy
Dataset
E. coli (LB) lambda p-Expression p-Nonexpression
7.597
0.000502
H. pylori (ML) 7.83
0.000359
0.005773
0.004541
Table S6. Specificity of predicted TSSs in the five samples of the H. pylori dataset
Sample
SRR031126
SRR031127
SRR031128
SRR031129
SRR031130
Treatment Specificity
ML
AS
78.43%
70.53%
PL
AG
HU
76.66%
74.00%
65.93%
Table S7 . Summary of assembled operons in the samples
# Genes expressed
# Hypothetical Proteins
# Operons
# Multi-gene operons
# Consistent operons
# Consistent multi-gene operons
# Alternative operons
LB(OD=0.87) HS15min HS30min HS60min M-P0h M-P2h M-P4h
4,314
29
2,131
875
1,064
207
1,065
4,366
29
2,247
915
1,086
207
1,160
4,340
31
2,635
853
1,081
207
1,552
4,222 4,296 4,395 4,420
27 28 30 32
2,865 2,452 2,339 2,091
732 825 933
1,049 1,055 1,098 1,105
206 207 206 207
1,815 1,396 1,239
933
981
Table S8 . Reconstruction of alternative phn operons
Sample
LB (OD=0.87)
HS15min
HS30min
HS60min
M-P0h
M-P2h
M-P4h
Operons/suboperons
phnC, phnD, phnH, phnK, phnL, phnM, phnNOP
phnCD, phnE, phnGHIJK, phnMNOP
phnC, phnDE, phnGH, phnI, phnJ,phnK, phnL, phnM, phnNOP
phnC, phnD, phnG, phnH, phnK, phnNOP
phnCDE, phnFGH, phnI, phnJK, phnLMNOP
phnCDEFGHIJKLMNOP
phnCDEFGHIJKLMNOP
Table S9 . Reconstruction of alternative fli operons
Sample
LB (OD=0.87)
HS15min
HS30min
HS60min
M-P0h
M-P2h
M-P4h
Operons/suboperons
fliFGHIJKLMNOPQR
fliFGHIJK, fliLMN, fliOPQ, fliR
fliFGH, fliI, fliJKL, fliMN, fliO, fliQ, fliR
fliGH, fliI, fliK, fliL, fliM, fliQ
fliFGHIJKLMN, fliOPQR
fliFGHIJKL, fliMN, fliOP, fliQR
fliFGHIJKLMNO, fliPQR
Table S10. Proportion of the ORFs and intergenic regions having antisense and non-coding RNA transcriptions
Sample
LB
HS15min
HS30min
HS60min
M-P0h
M-P2h
M-P4h genes with asRNA (%)
72.34
86.91
83.71
72.23
81.87
88.54
89.54
Intergenic regions with ncRNA (%)
51.37
69.67
67.99
54.12
59.32
70.38
72.74
References
1.
2.
3.
4.
Vivancos AP, Guell M, Dohm JC, Serrano L, Himmelbauer H: Strand-specific deep sequencing of the transcriptome. Genome Res 2010, 20:989-999.
Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI,
Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, et al:
RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation.
Nucleic Acids Res 2008, 36:D120-124.
Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2:28-36.
Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K,
Hackermuller J, Reinhardt R, et al: The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 2010, 464:250-255.