GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA

GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA Outline  Motivation  NGS Issues and Requirements  Pair-HMM Memory Optimizations  Results  Conclusion  Motivation Mutation Detection:  SNP discovery  HapMap and resequencing Species Identification  Bisulfite Sequencing   Epigenetic influences  RNA editing Error Rates* Instrument Run Time Mb/run Bases/re ad Primary Error Type Error Rate (%) 3730xl (Capillary) 2h 0.06 650 Substitution 0.1-1 454 FLX+ 18-20 h 900 700 Indel 1 Illumina HiSeq2000 10 days ≤ 600,000 100+100 Substitution ≥0.1 Ion Torrent – 318 chip 2h >1000 >100 Indel ~1 PacBio RS 0.5-2h 5-10 860-1100 CG Deletions 16 * Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011 Pair-HMM Pair-wise Alignment: A | A G T | T -- A | A -- G | G C A C -- A C | C Equivalent Hidden Markov State Sequence: GX q egin M pAA TGM GX q TMG M M M pTT p TMM AA pGG TGM GY q TMG TGM TMG End M M p pCA TMM CC Pair-HMM (Mathematics)  Match  Gap (in both directions) Pair-HMM (M) a t a c g a c t a 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00 t 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 Pair-HMM (X) a t a c g a c t a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Pair-HMM (Y) a t a c g a c t a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00 g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Pair-HMM A C G T a 1.00 0.00 0.00 0.00 g 0.00 0.00 0.68 0.31 t 0.32 0.00 0.00 0.68 a 0.99 0.00 0.00 0.00 g 0.00 0.00 1.00 0.00 a 1.00 0.00 0.00 0.00 c 0.00 1.00 0.00 0.00 c 0.00 1.00 0.00 0.00 Expected Results CHR POS TOT A C G T SNP? chrX 1755234 17.00 0.00 0.00 17 0.00 N chrX 1755235 18.00 0.00 18.00 0.00 0.00 N chrX 1755236 19.00 9.99 0.00 9.00 0.01 Y:g->a/g chrX 1755237 19.50 0.00 0.00 0.00 19.50 N chrX 1755238 19.50 0.00 0.00 19.50 0.00 N chrX 1755239 46.00 0.01 19.49 0.00 0.00 N PVAL 2.54e-08 Why Inline SNP Calling?  Post-Processing  Disk space, less memory  Inline  Requires more memory  Less disk space  Can include specifics probabilities for each read Previous Optimizations  Two methods for speeding up mapping: 1. Entire genome on one machine 2. Split memory among different machines ○ Must normalize across all genome portions ○ MPI reduction Previous Optimizations 1000 2000 Sequence Processing Rate for Memory Allocation ● 200 ● 100 ● ● 50 ● 20 ● ● ● 10 # sequences / second 500 ● Expected Single Machine Spread Memory ● 1 2 4 8 16 # Processes 32 64 128 256 Memory Requirements  Human Genome (3gb)  HashMap ≈ 12GB  4 bits/character = 1.5GB  5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB  Also stores total for easy computation = sizeof(float) * 3GB = 12GB  Total of ≈ 90GB per run Three Memory Optimizations Normal (no optimization)  Integer discretization  Centroid discretization  Integer Discretization Only need one floating point value (for total) and 1 byte/nucleotide.  “Parts per 255”  Biggest hit: Going into and out of “integer space”  Integer Discretization    Step 1: Convert from Integer Space Step 2: Add from ri to Genome Step 3: Convert back to Integer Space Added from ri: 1.0 0.00 0.68 0.31 0.01 0.00 Total A C G T N 12.0 13.0 231 228 7 13 12 11 3 2 Total A C G T N 12.0 13.0 10.9 11.6 0.33 0.64 0.56 0.57 0.15 Genome 3 2 0.15 Centroid Discretization  Many states not used:  [255, 255, 255, 255, 255]  [0, 0, 0, 0, 0]  Many states not biologically relevant  SNP transition (common) vs transversion (not likely)  MSA uses this compression to perform fast alignment of one-to-many alignment Centroid Discretization (cont) Centroid Discretization (cont)  Benefits  Doesn’t waste impossible or infrequently used space  Much smaller memory footprint  Drawbacks:  Slight overhead in converting from centroid to floating point spaces  Rounding error (how significant?) Speed Comparison Optimization Stats (chrX) Optimization Memory Mem % Wallclock TP FP Normal 4.76GB 100% 04:25:55 1309 127 CharDisc 2.58GB 54.2% 04:36:58 677 0 CentDisc 2.01GB 42.2% 04:27:29 166 9058 Conclusion  For high error rates, HMM approach is ideal, but requires more memory  Distributing the genome across processors doesn’t scale linearly  Discretization methods provide good memory reductions (up to 42%)  Centroid discretization performs poorly  Integer discretization can be used when available memory is low Questions

GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA

Related documents

Products

Support

GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib