Figure S1: A log-log plot of all B2 Ocean metagenome read yields per starting DNA amount. The X-axis denotes starting DNA amount; the Y axis is the number of reads in each metagenome, normalized to starting DNA amount. The three metagenomes in the top left, giving most reads per ng of input DNA, were amplified using the Linker Amplification protocol. Figure S2: %G+C histogram of several ‘problematic’ and ‘reliable’ libraries, and GC distribution of full dsDNA bacteriophage genomes for reference. A) Reliable metagenome and problematic metagenome. Reliable 1000ng Illumina metagenome in blue, problematic 100ng Illumina metagenome in green. Pearson’s correlation r value of 0.95. B) Two reliable metagenomes: Illumina 1000ng in blue, 454 1500ng in green. Pearson’s correlation r value of 0.99. C) Reference GC distribution within order Caudovirales. Myoviridae family is in blue, Siphoviridae in green, and Podoviridae in red. Genomic sequences accessed from NCBI April 2012. Figure S3: %G+C distribution differences between whole-read mean %G+C in unamplified 454 metagenome, in green, and Sanger-sequenced fosmid library, in blue, shows a shift toward high %G+C in the fosmid library. Figure S4: Duplicate frequencies in Experiment 1 metagenomes. Calculated as exact duplicates over the first 50 bp of each read only. Figure S5: Heatmap of Pearson’s r pairwise correlation values for artificial duplicate frequencies, as detected using CD-HIT-454 for 454 and Ion Torrent data and CD-HIT-DUP for Illumina data. Note the separate keys provided for CD-HIT-454 data and CD-HIT-DUP data. Figure S6: CD-HIT-454 artificial duplicate frequencies in Experiment 1 metagenomes generated using 454 and Ion Torrent sequencing. Figure S7: Duplicate frequency minus artificial duplicate frequency for Experiment 1 CD-HIT-454 – processed metagenomes. Figure S8: CD-HIT-DUP artificial duplicate frequencies in Experiment 1 Illumina metagenomes. Figure S9: Duplicate frequency minus artificial duplicate frequency for Experiment 1 CD-HIT-DUP – processed metagenomes. Figure S10: Ion Torrent QC length distribution. Raw reads span a wide range of read lengths. After quality score and read length filtering, reads that pass QC have a much tighter length distribution. Reads that failed the quality filter only, <2 SD below mean read quality score, are distributed across the whole range of read lengths, as can be seen by comparing plot of Passed reads to the plot of reads that either Passed or Failed Quality filtering. Reads that failed the ambiguous nucleotide ‘N’ filter were very few in number, as seen by comparing the Raw read frequency plot to the plot of reads that either Passed or Failed Quality or Length filtering, but not N filtering. Figure S11: Methods for Trimming Illumina Reads. DynamicTrim.pl trimming, used on Illumina data in this paper, finds the longest contiguous segment above a PHRED threshold score of 20, and trims off everything else. For example, read 2 is trimmed down to the region from bp 15 to bp 35. This plot also shows an alternative metric, BWA (lighter lines), which trims reads only at the 3’ end at the location of maximum of the BWA score (very light lines). The BWA score increases along a read as long as each bp PHRED score is over the threshold score of 20. The BWA metric keeps an excess of low-quality data compared to the DynamicTrim.pl procedure. Figure S1 1.E+05 1.E+04 y = 532.67x-0.751 R² = 0.7119 1.E+03 QC'd Reads/ng 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 0.01 0.1 1 10 Starting DNA Amount (ng) Figure S2 100 1000 Figure S3 Figure S4 Tec h Pair Rep Amp 1 I F A 14 1000 0.9 I R A 14 1000 Relative frequency of binned reads 0.8 I F B 14 1000 0.7 I R B 14 1000 0.6 I F A 14 100 I R A 14 100 0.5 I F B 14 100 0.4 I R B 14 100 0.3 I F A 18 10 I R A 18 10 0.2 4 F A NA 1500 0.1 4 F B NA 1500 4 F A 15 10 0 1 2 3 4 5 6 Copy number bin 7 8 9 10+ 4 F B 15 10 Figure S5. Figure S6. 1 0.8 4 4 4 4 4 4 4 4 S T T 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 Copy number bin 7 8 9 10+ A A A B B B C C A A B 15 25 NA 15 25 NA 15 25 NA 5 5 ng Tech Rep Amp Relative frequency of binned reads 0.9 10 0.01 1500 10 0.01 1500 10 0.01 15000 1000 1000 Figure S7. Relative Frequency difference 0.25 ng Tech Rep Amp 0.3 4 A 15 10 0.2 4 A 25 0.01 4 A NA 1500 0.15 4 B 15 10 0.1 4 B 25 0.01 4 B NA 1500 0.05 4 C 15 10 0 1 2 3 4 5 6 7 8 9 10+ -0.05 -0.1 -0.15 4 C 25 0.01 S A NA 15000 T A 5 1000 T B 5 1000 Copy number bin Figure S8. 1 0.8 0.6 I A 14 1000 0.5 I A 14 100 0.4 I A 18 10 ng 0.7 Tech Rep Amp Relative frequency of binned reads 0.9 I B 14 1000 0.3 I B 14 100 0.2 0.1 0 1 2 3 4 5 6 7 Copy number bin 8 9 10+ Figure S9. 0.2 0.15 0.05 ng Tech Rep Amp Relative frequency difference 0.1 I A 14 1000 I A 14 100 0 1 2 3 4 5 6 7 8 9 I A 18 10 10+ I B 14 1000 -0.05 I B 14 100 -0.1 -0.15 -0.2 Copy number bin Figure S10. 100000 1000 Pass Pass or Fail Qual 100 Pass or Fail Qual or Length Raw Reads 10 1 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 Read frequency 10000 Read length (bp) Figure S11. 10000 Read 1 Read 2 Read 3 PHRED(20) 100 Read 1 BWA score Read 2 BWA score Read 3 BWA score Read 1 max BWA score 10 Read 2 max BWA score Read 3 max BWA score 1 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 Quality score / BWA score 1000 Read length (bp)