Indel

From calling bases to calling variants: Experiences with Illumina data Gerton Lunter Wellcome Trust Centre for Human Genetics This talk  Refresher: Illumina sequencing  QC    What can go wrong Useful QC statistics Read mapping   Comparison of popular read mappers Stampy  Indel and SNP calling  (Some results: 1000 Genomes indel calls) 7x Illumina GA-II 2x Roche 454 WTCHG production (Gb) 700 600 500 400 300 200 100 0 1x Illumina HiSeq 2000 1. Refresher: Illumina sequencing Illumina sequencing Illumina sequencing  8 lanes… x 120 tiles x108 bp x 2 reads… = about 48 Gb raw bp 2. QC Quality issues  Bases are identified by their fluorescent tag   Single base per cycle: reversible terminator chemistry    Not perfect: fraction lags, fraction runs ahead: “Phasing” Limits read length Optimizing yield: cluster density     Overlapping emission spectra Higher densities mean more errors Above an optimum, yield decreases Partly signal processing issue: software improvements Low amounts of initial DNA  Linker-linker hybrids; duplicated reads Overlapping fluorescence spectra  C/A and G/T overlap  (Most common mutations are transitions, A-G and C-T) Rougemont et al, 2008 Refresher: Phred scores  Phred score = 10 log10( probability of error )  10: 10% error probability  20: 1% error probability  30: 0.1% error probability (one in 1,000) 3 = 50%, 7 = 20%  13 = 5%, 17 = 2%  23 = 0.5%, 27 = 0.2%  33 = 0.05%, 37 = 0.02%  Phasing August 2009 June 2010 Cluster density & other improvements August 2009: June 2010: Library complexity, duplicate reads  Some sequences are read several times:    Problem for variant calling    Criterion: both ends of a PE read map to matching location Can occur by chance, but low probability, except for very high coverage Post processing: duplicate removal   Any PCR error will be seen twice: evidence for variant Rate of duplicates is rarely >5%   Low amount of initial material, many PCR copies Optical duplicates; secondary cluster seeding Standard processing step (e.g. Samtools, Picard) Useful statistic:   Duplicate fraction is approximately additive across lanes (same library) 2x duplication fraction ≈ fraction of the library that was sequenced Library complexity, duplicate reads Fraction α of all molecules is sequenced Number of times a PCR copy is sequenced: Poisson(α) n= 0 1 2 3 … Poisson e-α α e-α α2 e-α/2 α3 e-α/6 … 0 1 2 … Duplicates: 0 Expected fraction of duplicates: e-α-1+α As a fraction of all reads sequenced: (e-α-1+α)/α = ½α+… Sequencing QC QC statistics QC statistics - coverage QC statistics – quality scores GATK recalibration tool 3. Read mapping Read mapping  First processing step after sequencing: Read mapping (most times)  Assembly (no reference sequence; specialized analyses)   Quality of mapping determines downstream results Accessible genome  Biases (ref vs. variant)  Sensitivity (divergent reference; snps, indels, SV)  Specificity (calibration of mapping quality)  Read mapper comparison  Read mappers:  Maq  BWA  Eland  Novoalign  Stampy  Criteria:  Sensitivity (overall; divergent reference; variants)  Specificity (mapping quality calibration)  Speed Sensitivity Sensitivity - indels Sensitivity – Divergent reference Specificity – ROC curves ROC - indels Proportion mapped to within 10kb of mate Performance on real data Efficiency Stampy – first part of algorithm read 15 bp subsequence 4 bytes x 229 entry (2 Gb) hash table candidate positions Remove rev-comp symmetry 29 bit word open addressing, cache-friendly Second part: Fast candidate alignment Single-instructionmultiple-data (SIMD), parallel execution Affine gap penalties. Linear-time and constant memory algorithm: DP table in registers. Maximum indel size 15 bp. Third part: Modeling mapping failures  Pseudo-bayesian posterior (using candidates, rather than all mapping positions)  Failure to find the correct candidate (2 or more mismatches in every 15bp subsequence)  Sequence not in reference (is sequence match better than expected best random match?) 4. SNP and indel calling SNP calling  General idea:  Works quite well! Some caveats: Include mapping quality: P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map)  Mapping errors are dependent: don’t include mapQ<10  Base errors are not uniform (A/C/G/T): assume worst case (all identical)  Assumes no anomalies (seg dups; alignments; indel/SV; …)   Hard problem: be conservative Expected SNP rate (human): 10-3/nt. FPR of 10-5 required for 1% FDR  Filtering is required to achieve good FDR – or all data features must be adequately modeled  Indel calling  General idea:  Differences with SNP calling:  Pseudo-Bayes: cannot consider all possible variants/genotypes Generate large set of candidates Filter using goodness-of-fit test  Illumina reads do not have an explicit indel error model Indel error model Indel error rate (per nt) 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1 2 3 4 5 6 7 8 9 10 11 12 Homopolymer run length 13 14 15 Wrap up  GA-II produces large amounts of good data  Artefacts do occur, keep a look at QC statistics  Choice of mapper influences yield and quality  Variant calling: Bayesian approaches work well  Some assumptions (independence) not met, hard to model  Filtering remains necessary 

Indel

Related documents

Products

Support

Indel

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib