Analyzing and Minimizing PCR Amplification Bias

advertisement
Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 1 of 6
Supplementary Figures 1-6
Number of 50-bp windows x 10-6
2
P. f alciparum
E. coli
1
R. sphaeroides
0
0%
20%
40%
60%
80%
GC content of 50-bp windows
Figure S1. Base composition of the three genomes. Histograms of the percent GC in 50-bp window for
the genomes of Plasmodium falciparum, Escherichia coli and Rhodobacter sphaeroides. Most of the
DNA in the equimolar composite “PER genome” is AT-rich because the AT-rich P. falciparum genome
(23 Mb) is 5 times larger than the GC-rich R. sphaeroides genome (4.6 Mb) and the intermediate E. coli
genome (4.6 Mb).
Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 2 of 6
Relative abundance
100%
10%
1%
0%
0%
20%
40%
60%
80%
100%
20%
40%
60%
80%
100%
Relative abundance
100%
10%
1%
0%
0%
GC content of amplicon
Figure S2. Testing the optimized enrichment PCR on two different thermocyclers and with longinsert fragment libraries. (a) Shown is the GC-bias profile of a ~180-bp fragment library PCRamplified by AccuPrime Taq HiFi (long denat., primer extension at 65˚C) on the fast ramping
thermocycler #1 (blue line) and on the slow ramping thermocycler #3 (gray line). (b) Shown are the GCbias profiles of a short-insert (~180 bp; blue line, same as in a) and of a long-insert (~360 bp) Accuprimeamplified fragment library (dark blue).
Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 3 of 6
Relative abundance
a
100%
10%
1%
0%
0%
20%
40%
60%
80%
100%
0%
20%
40%
60%
80%
100%
0%
20%
40%
60%
80%
100%
b
Relative abundance
100%
10%
1%
0%
Relative abundance
c
100%
10%
1%
0%
GC content of amplicon
Figure S3. Comparing input library and output sequencing data. A ~400 bp fragment library was
amplified with (a) Phusion HF (short denat, fast ramp), (b) Phusion HF (long denat., 2M betaine) or (c)
Accuprime Taq HiFi (long denat., primer extension at 65˚C). Shown is the relative abundance of loci in
each library as determined by qPCR (filled symbols) and the relative abundance of Illumina sequencing
reads covering these loci in one lane of Hi-Seq (open symbols). Data sets were normalized to the average
of the two loci closest to 50% GC.
Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 4 of 6
a
Relative representation
200%
R² = 0.31
150%
100%
50%
R² = 0.76
0%
50%
Relative representation
b
70%
80%
90%
200%
150%
R² = 0.11
100%
50%
R² = 0.80
0%
50%
c
60%
60%
70%
80%
90%
Relative representation
200%
150%
100%
R² = 0.31
50%
R² = 0.86
0%
50%
60%
70%
80%
90%
GC content of extended locus
Fig. S4. Abundance of GC-rich amplicons as a function of the GC content of a ~250-bp window.
Shown is the relative abundance in fragment libraries (filled symbols) and the relative sequence coverage
(open symbols) of 16 qPCR amplicons ranging from 50 bp to 63 bp in length plotted over the GC content
of the amplicon plus 100 bp on either side. The libraries had ~400-bp inserts and were amplified using (a)
Phusion HF (long denat., 2M betaine), (b) Accuprime Taq HiFi (long denat., primer extension at 65˚C, or
(c) Accuprime Taq HiFi (long denat., primer extension at 60˚C). The relative sequence coverage of the
qPCR amplicons decreases almost linearly with GC content of the extended loci (R2 values from 0.76 to
0.86). When plotted over the GC content of the amplicons proper, the R2 values for sequence coverage
were lower (0.50) in all three cases. Depending on the PCR conditions, the abundance of qPCR amplicons
in the library can increase (a, b) or decrease (c) with the GC-content. Sequence coverage of high-GC loci
is lower than their abundance in the library due to bias downstream of library construction.
Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 5 of 6
b
a
100%
Relative representation
Relative representation
100%
10%
1%
0%
10%
1%
0%
0%
20%
40%
60%
80%
100%
0%
20%
40%
%GC
c
80%
100%
60%
80%
100%
d
100%
100%
Relative representation
Relative representation
60%
%GC
10%
1%
0%
10%
1%
0%
0%
20%
40%
60%
%GC
80%
100%
0%
20%
40%
%GC
Figure S5. Comparison of qPCR data and genome-wide sequence data. The conditions for PCRamplification of the ~400-bp fragment library were (a) Phusion HF (short denat., fast ramp), (b) Phusion
HF (long denat., 2M betaine) or (c) Accuprime Taq HiFi (long denat., primer extension at 65˚C) and (d)
Accuprime Taq HiFi (long denat., primer extension at 60˚C). Squiggly lines are the abundance of
sequencing reads aligning to just the 36 qPCR amplicon sequences normalized to the average of the two
loci closest to 50% GC. Smooth lines are the ratio of observed to expected (unbiased) average read
coverages of 50-bp windows in 2% GC bins normalized to the average of the bins from 48% to 52% GC.
Aird et al.: Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries Page 6 of 6
AccuPrime 65oC
a
combined
Phusion 2M betaine
b
combined
AccuPrime 60oC
Relative coverage (log scale)
Relative coverage (log scale)
AccuPrime 60oC
100%
10%
1%
100%
0%
10%
1%
0%
0%
20%
40%
60%
80%
100%
AccuPrime 65oC
c
0%
20%
40%
60%
combined
AccuPrime 60oC
AccuPrime 60oC
120%
Relative coverage (linear scale)
120%
Relative coverage (linear scale)
100%
Phusion 2M betaine
d
combined
80%
100%
80%
60%
40%
20%
0%
100%
80%
60%
40%
20%
0%
0%
20%
40%
60%
80%
GC content of 50-bp windows
100%
0%
20%
40%
60%
80%
GC content of 50-bp windows
Figure S6. Pooling sequencing reads from differently amplified libraries. Shown are GC-bias curves
for a single lane of Hi-Seq sequencing reads from two differently amplified libraries and composite GCbias plots for the reads from the same two lanes combined. The ~400-bp libraries were amplified with
AccuPrime Taq HiFi with primer extension at 65˚C and 60˚C (a, c) and Phusion long denat. with 2M
betaine and AccuPrime HiFi Taq with primer extension at 60˚C (b, d). The y-axis is the mean coverage of
50-bp windows in the “PER” genome having the %GC indicated on the x-axis divided by the mean
coverage of the mid-GC (48- 52%) references plotted on a log10 (a, b) or linear scale (c, d).
100%
Download