Sequencing Errors and Biases Biological Sequence Analysis BNFO 691/602 Spring 2013 Mark Reimers Outline • • • • • Sequencing errors Initiation biases Quantification biases Are biases consistent across samples? Compensating biases Types of mismatches in Illumina data are profoundly asymmetric and biased 800000 700000 600000 500000 400000 300000 200000 from uniquely mapped tags with a single mismatch Any single G >T G >A A >C G >C C >A T >A T >C C >T A >G A >T T >G C >G Delete T Delete A Insert A Insert T Delete G Insert C Delete C 0 Insert G 100000 Courtesy Thierry-Mieg Position of single mismatch in uniquely mapped tags 60000 50000 40000 30000 sample 1 sample 2 20000 10000 0 0 3 6 9 12 15 18 21 24 27 30 33 36 position of single mismatch Courtesy Thierry-Mieg Initiation Biases Nucleotide frequencies versus position for stringently mapped reads. Hansen K D et al. Nucl. Acids Res. 2010;38:e131-e131 © The Author(s) 2010. Published by Oxford University Press. Start Position Bias is Visible in MT-RNA Start Position Bias is Consistent Across Samples Counts per start site in lane 1 vs lane 2 (Marioni et al, Gen Res, 2008) Quantification Biases Consistent Technology-Specific Biases (a) 25-kb region of chromosome 11 amplified by three longrange PCR products (red rectangles). (b) A heat-map colored matrix displays the correlation of coverage depth across 260 kb of sequence between four samples by three technologies from Harrismendy et al Genome Biology 2009 Quantitative Biases • Not all regions represented equally • GC rich regions represented more • Independent of GC some chromosome regions represented more – Euchromatin bias • Sequence initiation site biases • ‘Mapability’ biases – some regions won’t have any uniquely mapped tags GC Bias Number of Reads in 1 kb region • Density of reads depends strongly on GC content of regions • Most bias seems to come from PCR reaction • Newer techniques show less bias but still strong GC content (%) of 1 kb region From Dohm et al 2008 GC Bias depends on temperature • Aird et al (Genome Biology 2011) did systematic tests of effects of various conditions on GC bias • They provided protocols that improve CG bias but don’t eliminate it NB. Log scale Even Best Protocols have Bias • GC bias in Illumina reads from a 400-bp fragment library amplified using the standard PCR protocol (Phusion HF, short denaturation) on a fastramping thermocycler (red squares), Phusion HF with long denaturation and 2M betaine (black triangles), AccuPrime Taq HiFi with long denaturation and primer extension at 65°C (blue diamonds) or 60°C (purple diamonds) From Aird et al Genome Biology 2011 Biases Are NOT Consistent • The plot on left shows Log-fold changes between RPKM values from two biological replicates (NA11918, NA12761) from the data of Montgomery et al, Nature 2010 • From Hansen et al 2012