GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA Outline Motivation NGS Issues and Requirements Pair-HMM Memory Optimizations Results Conclusion Motivation Mutation Detection: SNP discovery HapMap and resequencing Species Identification Bisulfite Sequencing Epigenetic influences RNA editing Error Rates* Instrument Run Time Mb/run Bases/re ad Primary Error Type Error Rate (%) 3730xl (Capillary) 2h 0.06 650 Substitution 0.1-1 454 FLX+ 18-20 h 900 700 Indel 1 Illumina HiSeq2000 10 days ≤ 600,000 100+100 Substitution ≥0.1 Ion Torrent – 318 chip 2h >1000 >100 Indel ~1 PacBio RS 0.5-2h 5-10 860-1100 CG Deletions 16 * Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011 Pair-HMM Pair-wise Alignment: A | A G T | T -- A | A -- G | G C A C -- A C | C Equivalent Hidden Markov State Sequence: GX q egin M pAA TGM GX q TMG M M M pTT p TMM AA pGG TGM GY q TMG TGM TMG End M M p pCA TMM CC Pair-HMM (Mathematics) Match Gap (in both directions) Pair-HMM (M) a t a c g a c t a 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00 t 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 Pair-HMM (X) a t a c g a c t a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Pair-HMM (Y) a t a c g a c t a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Pair-HMM A C G T a 1.00 0.00 0.00 0.00 g 0.00 0.00 0.68 0.31 t 0.32 0.00 0.00 0.68 a 0.99 0.00 0.00 0.00 g 0.00 0.00 1.00 0.00 a 1.00 0.00 0.00 0.00 c 0.00 1.00 0.00 0.00 c 0.00 1.00 0.00 0.00 Expected Results CHR POS TOT A C G T SNP? chrX 1755234 17.00 0.00 0.00 17 0.00 N chrX 1755235 18.00 0.00 18.00 0.00 0.00 N chrX 1755236 19.00 9.99 0.00 9.00 0.01 Y:g->a/g chrX 1755237 19.50 0.00 0.00 0.00 19.50 N chrX 1755238 19.50 0.00 0.00 19.50 0.00 N chrX 1755239 46.00 0.01 19.49 0.00 0.00 N PVAL 2.54e-08 Why Inline SNP Calling? Post-Processing Disk space, less memory Inline Requires more memory Less disk space Can include specifics probabilities for each read Previous Optimizations Two methods for speeding up mapping: 1. Entire genome on one machine 2. Split memory among different machines ○ Must normalize across all genome portions ○ MPI reduction Previous Optimizations 1000 2000 Sequence Processing Rate for Memory Allocation ● 200 ● 100 ● ● 50 ● 20 ● ● ● 10 # sequences / second 500 ● Expected Single Machine Spread Memory ● 1 2 4 8 16 # Processes 32 64 128 256 Memory Requirements Human Genome (3gb) HashMap ≈ 12GB 4 bits/character = 1.5GB 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB Also stores total for easy computation = sizeof(float) * 3GB = 12GB Total of ≈ 90GB per run Three Memory Optimizations Normal (no optimization) Integer discretization Centroid discretization Integer Discretization Only need one floating point value (for total) and 1 byte/nucleotide. “Parts per 255” Biggest hit: Going into and out of “integer space” Integer Discretization Step 1: Convert from Integer Space Step 2: Add from ri to Genome Step 3: Convert back to Integer Space Added from ri: 1.0 0.00 0.68 0.31 0.01 0.00 Total A C G T N 12.0 13.0 231 228 7 13 12 11 3 2 Total A C G T N 12.0 13.0 10.9 11.6 0.33 0.64 0.56 0.57 0.15 Genome 3 2 0.15 Centroid Discretization Many states not used: [255, 255, 255, 255, 255] [0, 0, 0, 0, 0] Many states not biologically relevant SNP transition (common) vs transversion (not likely) MSA uses this compression to perform fast alignment of one-to-many alignment Centroid Discretization (cont) Centroid Discretization (cont) Benefits Doesn’t waste impossible or infrequently used space Much smaller memory footprint Drawbacks: Slight overhead in converting from centroid to floating point spaces Rounding error (how significant?) Speed Comparison Optimization Stats (chrX) Optimization Memory Mem % Wallclock TP FP Normal 4.76GB 100% 04:25:55 1309 127 CharDisc 2.58GB 54.2% 04:36:58 677 0 CentDisc 2.01GB 42.2% 04:27:29 166 9058 Conclusion For high error rates, HMM approach is ideal, but requires more memory Distributing the genome across processors doesn’t scale linearly Discretization methods provide good memory reductions (up to 42%) Centroid discretization performs poorly Integer discretization can be used when available memory is low Questions