Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 1 of 6 Supplementary Figures 1-6 Number of 50-bp windows x 10-6 2 P. f alciparum E. coli 1 R. sphaeroides 0 0% 20% 40% 60% 80% GC content of 50-bp windows Figure S1. Base composition of the three genomes. Histograms of the percent GC in 50-bp window for the genomes of Plasmodium falciparum, Escherichia coli and Rhodobacter sphaeroides. Most of the DNA in the equimolar composite “PER genome” is AT-rich because the AT-rich P. falciparum genome (23 Mb) is 5 times larger than the GC-rich R. sphaeroides genome (4.6 Mb) and the intermediate E. coli genome (4.6 Mb). Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 2 of 6 Relative abundance 100% 10% 1% 0% 0% 20% 40% 60% 80% 100% 20% 40% 60% 80% 100% Relative abundance 100% 10% 1% 0% 0% GC content of amplicon Figure S2. Testing the optimized enrichment PCR on two different thermocyclers and with longinsert fragment libraries. (a) Shown is the GC-bias profile of a ~180-bp fragment library PCRamplified by AccuPrime Taq HiFi (long denat., primer extension at 65˚C) on the fast ramping thermocycler #1 (blue line) and on the slow ramping thermocycler #3 (gray line). (b) Shown are the GCbias profiles of a short-insert (~180 bp; blue line, same as in a) and of a long-insert (~360 bp) Accuprimeamplified fragment library (dark blue). Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 3 of 6 Relative abundance a 100% 10% 1% 0% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% b Relative abundance 100% 10% 1% 0% Relative abundance c 100% 10% 1% 0% GC content of amplicon Figure S3. Comparing input library and output sequencing data. A ~400 bp fragment library was amplified with (a) Phusion HF (short denat, fast ramp), (b) Phusion HF (long denat., 2M betaine) or (c) Accuprime Taq HiFi (long denat., primer extension at 65˚C). Shown is the relative abundance of loci in each library as determined by qPCR (filled symbols) and the relative abundance of Illumina sequencing reads covering these loci in one lane of Hi-Seq (open symbols). Data sets were normalized to the average of the two loci closest to 50% GC. Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 4 of 6 a Relative representation 200% R² = 0.31 150% 100% 50% R² = 0.76 0% 50% Relative representation b 70% 80% 90% 200% 150% R² = 0.11 100% 50% R² = 0.80 0% 50% c 60% 60% 70% 80% 90% Relative representation 200% 150% 100% R² = 0.31 50% R² = 0.86 0% 50% 60% 70% 80% 90% GC content of extended locus Fig. S4. Abundance of GC-rich amplicons as a function of the GC content of a ~250-bp window. Shown is the relative abundance in fragment libraries (filled symbols) and the relative sequence coverage (open symbols) of 16 qPCR amplicons ranging from 50 bp to 63 bp in length plotted over the GC content of the amplicon plus 100 bp on either side. The libraries had ~400-bp inserts and were amplified using (a) Phusion HF (long denat., 2M betaine), (b) Accuprime Taq HiFi (long denat., primer extension at 65˚C, or (c) Accuprime Taq HiFi (long denat., primer extension at 60˚C). The relative sequence coverage of the qPCR amplicons decreases almost linearly with GC content of the extended loci (R2 values from 0.76 to 0.86). When plotted over the GC content of the amplicons proper, the R2 values for sequence coverage were lower (0.50) in all three cases. Depending on the PCR conditions, the abundance of qPCR amplicons in the library can increase (a, b) or decrease (c) with the GC-content. Sequence coverage of high-GC loci is lower than their abundance in the library due to bias downstream of library construction. Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 5 of 6 b a 100% Relative representation Relative representation 100% 10% 1% 0% 10% 1% 0% 0% 20% 40% 60% 80% 100% 0% 20% 40% %GC c 80% 100% 60% 80% 100% d 100% 100% Relative representation Relative representation 60% %GC 10% 1% 0% 10% 1% 0% 0% 20% 40% 60% %GC 80% 100% 0% 20% 40% %GC Figure S5. Comparison of qPCR data and genome-wide sequence data. The conditions for PCRamplification of the ~400-bp fragment library were (a) Phusion HF (short denat., fast ramp), (b) Phusion HF (long denat., 2M betaine) or (c) Accuprime Taq HiFi (long denat., primer extension at 65˚C) and (d) Accuprime Taq HiFi (long denat., primer extension at 60˚C). Squiggly lines are the abundance of sequencing reads aligning to just the 36 qPCR amplicon sequences normalized to the average of the two loci closest to 50% GC. Smooth lines are the ratio of observed to expected (unbiased) average read coverages of 50-bp windows in 2% GC bins normalized to the average of the bins from 48% to 52% GC. Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 6 of 6 AccuPrime 65oC a combined Phusion 2M betaine b combined AccuPrime 60oC Relative coverage (log scale) Relative coverage (log scale) AccuPrime 60oC 100% 10% 1% 100% 0% 10% 1% 0% 0% 20% 40% 60% 80% 100% AccuPrime 65oC c 0% 20% 40% 60% combined AccuPrime 60oC AccuPrime 60oC 120% Relative coverage (linear scale) 120% Relative coverage (linear scale) 100% Phusion 2M betaine d combined 80% 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% GC content of 50-bp windows 100% 0% 20% 40% 60% 80% GC content of 50-bp windows Figure S6. Pooling sequencing reads from differently amplified libraries. Shown are GC-bias curves for a single lane of Hi-Seq sequencing reads from two differently amplified libraries and composite GCbias plots for the reads from the same two lanes combined. The ~400-bp libraries were amplified with AccuPrime Taq HiFi with primer extension at 65˚C and 60˚C (a, c) and Phusion long denat. with 2M betaine and AccuPrime HiFi Taq with primer extension at 60˚C (b, d). The y-axis is the mean coverage of 50-bp windows in the “PER” genome having the %GC indicated on the x-axis divided by the mean coverage of the mid-GC (48- 52%) references plotted on a log10 (a, b) or linear scale (c, d). 100%