file - BioMed Central

advertisement
Figure S1: A log-log plot of all B2 Ocean metagenome read yields per starting DNA amount. The X-axis
denotes starting DNA amount; the Y axis is the number of reads in each metagenome, normalized to
starting DNA amount. The three metagenomes in the top left, giving most reads per ng of input DNA,
were amplified using the Linker Amplification protocol.
Figure S2: %G+C histogram of several ‘problematic’ and ‘reliable’ libraries, and GC distribution of full
dsDNA bacteriophage genomes for reference.
A) Reliable metagenome and problematic metagenome. Reliable 1000ng Illumina metagenome in
blue, problematic 100ng Illumina metagenome in green. Pearson’s correlation r value of 0.95.
B) Two reliable metagenomes: Illumina 1000ng in blue, 454 1500ng in green. Pearson’s correlation
r value of 0.99.
C) Reference GC distribution within order Caudovirales. Myoviridae family is in blue, Siphoviridae
in green, and Podoviridae in red. Genomic sequences accessed from NCBI April 2012.
Figure S3: %G+C distribution differences between whole-read mean %G+C in unamplified 454
metagenome, in green, and Sanger-sequenced fosmid library, in blue, shows a shift toward high %G+C in
the fosmid library.
Figure S4: Duplicate frequencies in Experiment 1 metagenomes. Calculated as exact duplicates over the
first 50 bp of each read only.
Figure S5: Heatmap of Pearson’s r pairwise correlation values for artificial duplicate frequencies, as
detected using CD-HIT-454 for 454 and Ion Torrent data and CD-HIT-DUP for Illumina data. Note the
separate keys provided for CD-HIT-454 data and CD-HIT-DUP data.
Figure S6: CD-HIT-454 artificial duplicate frequencies in Experiment 1 metagenomes generated using
454 and Ion Torrent sequencing.
Figure S7: Duplicate frequency minus artificial duplicate frequency for Experiment 1 CD-HIT-454 –
processed metagenomes.
Figure S8: CD-HIT-DUP artificial duplicate frequencies in Experiment 1 Illumina metagenomes.
Figure S9: Duplicate frequency minus artificial duplicate frequency for Experiment 1 CD-HIT-DUP –
processed metagenomes.
Figure S10: Ion Torrent QC length distribution. Raw reads span a wide range of read lengths. After
quality score and read length filtering, reads that pass QC have a much tighter length distribution. Reads
that failed the quality filter only, <2 SD below mean read quality score, are distributed across the whole
range of read lengths, as can be seen by comparing plot of Passed reads to the plot of reads that either
Passed or Failed Quality filtering. Reads that failed the ambiguous nucleotide ‘N’ filter were very few in
number, as seen by comparing the Raw read frequency plot to the plot of reads that either Passed or
Failed Quality or Length filtering, but not N filtering.
Figure S11: Methods for Trimming Illumina Reads. DynamicTrim.pl trimming, used on Illumina data in
this paper, finds the longest contiguous segment above a PHRED threshold score of 20, and trims off
everything else. For example, read 2 is trimmed down to the region from bp 15 to bp 35. This plot also
shows an alternative metric, BWA (lighter lines), which trims reads only at the 3’ end at the location of
maximum of the BWA score (very light lines). The BWA score increases along a read as long as each bp
PHRED score is over the threshold score of 20. The BWA metric keeps an excess of low-quality data
compared to the DynamicTrim.pl procedure.
Figure S1
1.E+05
1.E+04
y = 532.67x-0.751
R² = 0.7119
1.E+03
QC'd Reads/ng
1.E+02
1.E+01
1.E+00
1.E-01
1.E-02
1.E-03
1.E-04
0.01
0.1
1
10
Starting DNA Amount (ng)
Figure S2
100
1000
Figure S3
Figure S4
Tec
h
Pair
Rep
Amp
1
I F A 14 1000
0.9
I R A 14 1000
Relative frequency of binned reads
0.8
I F B 14 1000
0.7
I R B 14 1000
0.6
I F A 14 100
I R A 14 100
0.5
I F B 14 100
0.4
I R B 14 100
0.3
I F A 18 10
I R A 18 10
0.2
4 F A NA 1500
0.1
4 F B NA 1500
4 F A 15 10
0
1
2
3
4
5
6
Copy number bin
7
8
9
10+
4 F B 15 10
Figure S5.
Figure S6.
1
0.8
4
4
4
4
4
4
4
4
S
T
T
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
Copy number bin
7
8
9
10+
A
A
A
B
B
B
C
C
A
A
B
15
25
NA
15
25
NA
15
25
NA
5
5
ng
Tech
Rep
Amp
Relative frequency of binned reads
0.9
10
0.01
1500
10
0.01
1500
10
0.01
15000
1000
1000
Figure S7.
Relative Frequency difference
0.25
ng
Tech
Rep
Amp
0.3
4 A 15 10
0.2
4 A 25 0.01
4 A NA 1500
0.15
4 B 15 10
0.1
4 B 25 0.01
4 B NA 1500
0.05
4 C 15 10
0
1
2
3
4
5
6
7
8
9 10+
-0.05
-0.1
-0.15
4 C 25 0.01
S A NA 15000
T A 5
1000
T B 5
1000
Copy number bin
Figure S8.
1
0.8
0.6
I A 14 1000
0.5
I A 14 100
0.4
I A 18 10
ng
0.7
Tech
Rep
Amp
Relative frequency of binned reads
0.9
I B 14 1000
0.3
I B 14 100
0.2
0.1
0
1
2
3
4
5
6
7
Copy number bin
8
9
10+
Figure S9.
0.2
0.15
0.05
ng
Tech
Rep
Amp
Relative frequency difference
0.1
I A 14 1000
I A 14 100
0
1
2
3
4
5
6
7
8
9
I A 18 10
10+
I B 14 1000
-0.05
I B 14 100
-0.1
-0.15
-0.2
Copy number bin
Figure S10.
100000
1000
Pass
Pass or Fail Qual
100
Pass or Fail Qual or Length
Raw Reads
10
1
1
12
23
34
45
56
67
78
89
100
111
122
133
144
155
166
177
188
199
Read frequency
10000
Read length (bp)
Figure S11.
10000
Read 1
Read 2
Read 3
PHRED(20)
100
Read 1 BWA score
Read 2 BWA score
Read 3 BWA score
Read 1 max BWA score
10
Read 2 max BWA score
Read 3 max BWA score
1
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Quality score / BWA score
1000
Read length (bp)
Download