file - BioMed Central

advertisement

Supporting Figures and Tables

Figure S1.

Flowchart for the construction of directional RNA-seq libraries.

C

A B

D

Figure S2.

Correlation of expression levels of all the genes between the GAII and

HiSeq platforms. Each dot represents a gene. The expression level is log of the NPKB values. A) PCC of expression levels for HS15min between GAII reads and HiSeq reads. B) PCC of expression levels for HS15min between GAII reads and 2 nd HiSeq reads. C) PCC of expression levels for M-P4h between GAII reads and HiSeq reads.

D) PCC of expression levels for M-P4h between GAII reads and 2 nd HiSeq reads. The duplicates for each sample are from the same biological samples sequenced twice using the HiSeq 2000 platform.

Figure S3. Distribution of the genes with more than the indicated percentage of their length covered by at least one read in the samples generated by Vivancos et. al

[1]

.

Less than 60% of genes have over 50% of their length covered by at least one read.

Figure S4. Distribution of the length of the uniquely mapped reads in the samples.

C

A

B b2627 b2628 b2629

Figure S5. Position-dependent non-uniform read coverage along of the operon b2628-b2627. Not the highly similar patterns of the non-uniform coverage under different culture conditions and growth phases. Although no TSS is documented for this operon in RegulonDB, we identified the position 2,763,486 as the TSS for the operon in E. coli K12 in five samples LB, HS15min, M-P0h, M-P2h, and M-P4h.

M-P4h

M-P2h

M-P0h

HS15min

LB

Figure S6. Detected

70 binding sites (Pribnow boxes) in the promoter regions of the

known TSSs and predicted TSSs. A) Pribnow box found by MEME [3] in 539 of the

1742 known upstream promoter sequences (25nt). B) Pribnow box found in the [-100,

100] regions of the predicted TSSs appearing in multiple samples at p-value ≥ 0.05. C)

Pribnow box found on the [-100, 100] regions of the predicted TSSs appearing only in one sample at p-value ≤ 0.05.

SRR031127

HP0834 HP0835

Figure S7. Recovery of TSSs in the H. pylori data by our algorithm. The light vertical line at position 887,519 indicates the TSS of the gene HP0835 in sample SRR031127 determined by dRNA-seq by Sharma et. al

[4], and our algorithm made the same

prediction.

A

HP0186 HP0187

SRR031130

SRR031128

SRR031127

SRR031126

B

HP1138 HP1139

SRR031130

SRR031129

SRR031128

SRR031127

SRR031126

Figure S8. Unrecovered TSSs determined by dRNA-seq by Sharma et. al

[4]. The

light vertical lines indicate the positions of the TSSs. A) The determined TSS at

869,888 of the gene HP0187 is located in body of the upstream gene HP0186. B) The determined TSS at 1,200,759 of the gene HP1138 is located in the body of the upstream gene HP1139.

Pr( X

 k )

( 1

 p ) k p

E ( X )

 k

0

( 1

 p k

0

( 1

 p ) k p

 k p ) k k

 d p [ dp

(

 k

0

( 1

 p ) k

)]

( 1

 p )

  p ( 1

 d p ) dp

1

 p p

1 p

 p

1

1

E ( X )

1

 p

E ( X )

E ( X )

1

Figure S9 . Derivation of transition probabilities P

EE and P

NN

. The geometric distribution

Pr( X

 k )

( 1

 p ) k p

is used to model the number of failures before the first success. The length for the consecutive expression state E or non-expression

state N should follow a geometric distribution. Therefore the probability of staying in the expression state

P

EE

1

P

EN

E ( X )

E ( X )

1

, and the probability to transit from the expression state to the non-expression state is

P

EN

1

E ( X )

1

. Similar results can be derived for

P

NN and

P

NE

.

Figure S10 . Distributions of the lengths of ORFs and intergenic regions in the known operons in E. coli K12. A: Histogram of the lengths of ORFs (bin size =50nt). The curve is the geometric distribution with the success probability p = 0.0010483 estimated by the maximum likelihood method. B: Histogram of the lengths of intergenic regions within the known operons in

ReguonDB (bin size =20nt). The curve is the geometric distribution with the success probability p = 0.0023695 estimated by the maximum likelihood method. C: QQ-plot of the lengths of ORFs against the fitted geometric distribution. D: QQ-plot for the lengths of intergenic regions within the known operons against the fitted geometric distribution. Clearly,

unlike the distribution of the lengths of interoperonic regions, the lengths of ORFs cannot be fitted to a geometric distribution.

Table S1. Summary of the mapping results

Sample Platform

LB GAII

HS15min GAII+HiSeq

HS30min HiSeq

HS60min HiSeq

M-P0h HiSeq

M-P2h

M-P4h

HiSeq

GAII+HiSeq

Total reads

32,129,789

72,868,580

35,042,119

25,930,027

46,464,309

67,034,479

86,184,479

% Reads having adapter

Total reads after trimming

16.70%

60.29%

84.22%

31,767,554

72,586,098

34,979,745

80.37%

81.87%

76.30%

51.16%

25,905,637

46,342,018

66,962,875

85,795,131

Uniquely mapped reads

12,856,757

16,743,042

13,034,034

7,735,369

14,129,411

29,581,761

29,183,476

Multiple mapped reads

Reads failed to map

% Unique % Multiple % Failed

14,956,218 3,954,579 40.47%

45,758,784 10,084,272 23.07%

19,411,877 2,533,834 37.26%

47.08% 12.45%

63.04% 13.89%

55.49% 7.24%

15,403,470

27,602,193

2,766,798 29.86%

4,610,414 30.49%

31,717,549 5,663,565 44.18%

44,847,797 11,763,858 34.02%

59.46% 10.68%

59.56% 9.95%

47.37% 8.46%

52.27% 13.71%

Table S2. Effect of sequencing depth on the performance of TruHmm using sample

M-P4h as an example

Portion of total reads

10%

20%

40%

80%

100%

Total unique mapped nt

177,394,644

354,676,352

710,081,281

1,419,721,627

1,780,472,931

Sequencing depth

38.23

76.44

153.05

306.00

383.75

Sensitvity Specificity Accuracy Precision F-factor

86.60%

89.60%

94.50%

96.60%

97.20%

96.00%

91.00%

93.20%

97.10%

98.90%

89.00%

90.00%

93.50%

96.50%

97.70%

98.00% 92.00%

96.80% 93.00%

95.80% 95.20%

97.30% 96.95%

99.00% 98.10%

Total operons

2705

2513

2345

2133

2091

Zero-coverage positions (nt)

5,705,298

4,983,472

4,220,369

3,438,913

3,193,336

Table S3. Performance of TruHmm on the H. pylori dataset evaluated based on operon pairs

Sample Sensitvity Specificity Accuracy Precision F-factor

SRR031126 99.81% 85.51% 96.77% 96.24% 97.99%

SRR031127 99.62%

SRR031128 99.57%

SRR031129 99.80%

SRR031130 97.83%

90.85% 97.77% 97.60% 98.60%

94.96% 98.63% 98.73% 99.15%

95.19% 99.02% 99.02% 99.41%

98.21% 97.91% 99.56% 98.69%

Table S4. Performance of TruHmm on the H. pylori dataset of evaluated based on the entire operon structure

SRR031126

SRR031127

SRR031128

SRR031129

SRR031130

Sensitvity Specificity Accuracy Precision F-factor

99.74% 86.51% 95.40% 93.81% 96.68%

99.51% 89.58% 96.46% 95.57% 97.50%

99.59% 95.24% 98.14% 97.67% 98.62%

99.67% 95.37% 98.59% 98.46% 99.06%

98.67% 98.47% 98.61% 99.33% 99.00%

Table S5. Comparison of the parameters trained on the E. coli and H. pylori datasets using a window size 11nt and the leave-one-out strategy

Dataset

E. coli (LB) lambda p-Expression p-Nonexpression

7.597

0.000502

H. pylori (ML) 7.83

0.000359

0.005773

0.004541

Table S6. Specificity of predicted TSSs in the five samples of the H. pylori dataset

[4].

Sample

SRR031126

SRR031127

SRR031128

SRR031129

SRR031130

Treatment Specificity

ML

AS

78.43%

70.53%

PL

AG

HU

76.66%

74.00%

65.93%

Table S7 . Summary of assembled operons in the samples

# Genes expressed

# Hypothetical Proteins

# Operons

# Multi-gene operons

# Consistent operons

# Consistent multi-gene operons

# Alternative operons

LB(OD=0.87) HS15min HS30min HS60min M-P0h M-P2h M-P4h

4,314

29

2,131

875

1,064

207

1,065

4,366

29

2,247

915

1,086

207

1,160

4,340

31

2,635

853

1,081

207

1,552

4,222 4,296 4,395 4,420

27 28 30 32

2,865 2,452 2,339 2,091

732 825 933

1,049 1,055 1,098 1,105

206 207 206 207

1,815 1,396 1,239

933

981

Table S8 . Reconstruction of alternative phn operons

Sample

LB (OD=0.87)

HS15min

HS30min

HS60min

M-P0h

M-P2h

M-P4h

Operons/suboperons

phnC, phnD, phnH, phnK, phnL, phnM, phnNOP

phnCD, phnE, phnGHIJK, phnMNOP

phnC, phnDE, phnGH, phnI, phnJ,phnK, phnL, phnM, phnNOP

phnC, phnD, phnG, phnH, phnK, phnNOP

phnCDE, phnFGH, phnI, phnJK, phnLMNOP

phnCDEFGHIJKLMNOP

phnCDEFGHIJKLMNOP

Table S9 . Reconstruction of alternative fli operons

Sample

LB (OD=0.87)

HS15min

HS30min

HS60min

M-P0h

M-P2h

M-P4h

Operons/suboperons

fliFGHIJKLMNOPQR

fliFGHIJK, fliLMN, fliOPQ, fliR

fliFGH, fliI, fliJKL, fliMN, fliO, fliQ, fliR

fliGH, fliI, fliK, fliL, fliM, fliQ

fliFGHIJKLMN, fliOPQR

fliFGHIJKL, fliMN, fliOP, fliQR

fliFGHIJKLMNO, fliPQR

Table S10. Proportion of the ORFs and intergenic regions having antisense and non-coding RNA transcriptions

Sample

LB

HS15min

HS30min

HS60min

M-P0h

M-P2h

M-P4h genes with asRNA (%)

72.34

86.91

83.71

72.23

81.87

88.54

89.54

Intergenic regions with ncRNA (%)

51.37

69.67

67.99

54.12

59.32

70.38

72.74

References

1.

2.

3.

4.

Vivancos AP, Guell M, Dohm JC, Serrano L, Himmelbauer H: Strand-specific deep sequencing of the transcriptome. Genome Res 2010, 20:989-999.

Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI,

Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, et al:

RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation.

Nucleic Acids Res 2008, 36:D120-124.

Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2:28-36.

Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K,

Hackermuller J, Reinhardt R, et al: The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 2010, 464:250-255.

Download