Sequencing Errors and Biases

advertisement
Sequencing Errors and Biases
Biological Sequence Analysis
BNFO 691/602 Spring 2013
Mark Reimers
Outline
•
•
•
•
•
Sequencing errors
Initiation biases
Quantification biases
Are biases consistent across samples?
Compensating biases
Types of mismatches in Illumina data are
profoundly asymmetric and biased
800000
700000
600000
500000
400000
300000
200000
from uniquely mapped tags with a single mismatch
Any single
G >T
G >A
A >C
G >C
C >A
T >A
T >C
C >T
A >G
A >T
T >G
C >G
Delete T
Delete A
Insert A
Insert T
Delete G
Insert C
Delete C
0
Insert G
100000
Courtesy Thierry-Mieg
Position of single mismatch in uniquely
mapped tags
60000
50000
40000
30000
sample 1
sample 2
20000
10000
0
0
3
6
9 12 15 18 21 24 27 30 33 36
position of single mismatch
Courtesy Thierry-Mieg
Initiation Biases
Nucleotide frequencies versus position for stringently mapped reads.
Hansen K D et al. Nucl. Acids Res. 2010;38:e131-e131
© The Author(s) 2010. Published by Oxford University Press.
Start Position Bias is Visible in MT-RNA
Start Position Bias is Consistent Across Samples
Counts per start site in lane 1 vs lane 2
(Marioni et al, Gen Res, 2008)
Quantification Biases
Consistent Technology-Specific Biases
(a) 25-kb region of chromosome 11 amplified by three longrange PCR products (red rectangles).
(b) A heat-map colored matrix displays the correlation of
coverage depth across 260 kb of sequence between four
samples by three technologies
from Harrismendy et al Genome Biology 2009
Quantitative Biases
• Not all regions represented equally
• GC rich regions represented more
• Independent of GC some chromosome regions
represented more
– Euchromatin bias
• Sequence initiation site biases
• ‘Mapability’ biases – some regions won’t
have any uniquely mapped tags
GC Bias
Number of Reads in 1 kb region
• Density of reads
depends strongly on GC
content of regions
• Most bias seems to
come from PCR reaction
• Newer techniques show
less bias but still strong
GC content (%) of 1 kb region
From Dohm et al 2008
GC Bias depends on temperature
• Aird et al (Genome Biology 2011) did systematic
tests of effects of various conditions on GC bias
• They provided protocols that improve CG bias but
don’t eliminate it
NB. Log scale
Even Best Protocols have Bias
• GC bias in Illumina reads
from a 400-bp fragment
library amplified using the
standard PCR protocol
(Phusion HF, short
denaturation) on a fastramping thermocycler (red
squares), Phusion HF with
long denaturation and 2M
betaine (black triangles),
AccuPrime Taq HiFi with
long denaturation and
primer extension at 65°C
(blue diamonds) or 60°C
(purple diamonds)
From Aird et al Genome Biology 2011
Biases Are NOT Consistent
• The plot on left shows
Log-fold changes
between RPKM values
from two biological
replicates (NA11918,
NA12761) from the data
of Montgomery et al,
Nature 2010
• From Hansen et al 2012
Download