Servers, R and Wild Mice Robert William Davies Feb 5, 2014 Overview • 1 - How to be a good server citizen • 2 – Some useful tricks in R (including ESS) • 3 – Data processing – my wildmice pipeline 1 - How to be a good server citizen • Three basic things – CPU usage • cat /proc/cpuinfo • top • htop – RAM (memory) usage • top • htop – Disk IO and space • iostat • df -h Cat /proc/cpuinfo rwdavies@dense:~$ processor : vendor_id : cpu family : model : model name : stepping : microcode : cpu MHz : cache size : physical id : cat /proc/cpuinfo | head 0 AuthenticAMD 21 2 AMD Opteron(tm) Processor 6344 0 0x600081c 1400.000 2048 KB 0 rwdavies@dense:~$ cat /proc/cpuinfo | grep processor | wc -l 48 Memory can be in RAM or in swap Htop and top Swap is BAD Ideal use – 0 (in this case it is probably residual) RAM – 512GB total 142 in use (rest free) Load average – average over 1, 5, 15 minutes 48 cores rwdavies@dense:~$ iostat -m -x 2 disk use Relatively unused High sequential reading (fast!) Also note state – D = no IO There are also ways to optimize disk use for different IO requirements on a server – ask Winni Get sizes of directories Disk usage Get available disk space How to be a good server citizen Take away • CPU usage – Different servers different philosophies – At minimum, try for load <=number of cores • RAM – High memory jobs can take down a server very easily and will make others mad at you – best to avoid • Disk IO – For IO bound jobs you often get better combined throughput from running one or a few jobs than many in parallel – Also don’t forget to try and avoid clogging up disks • P.s. – A good server uptime is 100% 2 – Some useful tricks in R (including ESS) • R is a commonly used programming language / statistical environment • Pros – (Almost) everyone uses it – Very easy to use • Cons – It’s slow – It can’t do X • But! R can be faster, and it might be able to do X! Here I’ll show a few tricks ESS (Emacs speaks statistics) • Emacs is a general purpose text editor • Lots of programs exist for using R • There exists an extension to emacs called ESS allowing you to use R within emacs • This allows you to analyze data on a server very nicely using a split screen environment and keyboard shortcuts to run your code I have code on the left An R terminal on the right Running a line of code ctrl-c ctr-j Running a paragraph of code ctrl-c ctrl-p Switching windows ctrl-x o Google: ESS cheat sheet http://ess.r-project.org/refcard.pdf C- = ctrl M- = option key R – mclapply • lapply – apply a function to members of a list • mclapply – do it multicore! • Note there exists a spawning cost depending on memory of current R job Not 19X faster due to chromosome size differences Also I ran this on a 48 core server with a load of 40 /dev/shm • (This might be an Ubuntu Linux thing?) • Not really an R thing but can be quite useful for R multicore • /dev/shm uses RAM but operates like disk for many input / output things • Example: You loop on 2,000 elements, each of which generates an object of size 10Mb. You can pass that all back at once to R (very slow) or write to /dev/shm and read the files back in (faster) ff • Using save, load is an easy way to interact with data most of time time. ff allows you to use variables like a pointer to RAM Bonus – you can use mclapply to different entries without collisions! (same entries = collisions) Rcpp Some things R is not very good at – like for loops I write the c++ code as a text vector You can call fancy R things from c++ Simple example of c++ use in R: for a reference genome (say 60,000,000), coded as integer 0 to 3, determine the number of each possible Kmer of size K (ignoring converses for now) I then “compile” it here (takes ~1-3 seconds for simple (<1000) line things) I often just pass in lists, pass out lists (Note that upon making this slide I realized there is an easy way to do this using vectors) table(4^0 * ref[seq(1,refT-K)] + 4^1 * ref[seq(2,refTK+1)] + … ) and adjusting for NAs Complicated example I actually use • Want to do an EM HMM with 2000 subjects and up to 500,000 SNPs with 30 rounds of updating • Using R – pretty fast, fairly memory efficient – Use Rcpp to make c++ forward backward code in R – for iteration from 1 to 30 • mclapply on 2000 subjects – Run using Rcpp code, write output to /dev/shm/ • Aggregate results from /dev/shm to get new parameters – Write output (huge) to ff for easy downstream analysis 2 – Some useful tricks in R (including ESS) – Take away • A lot of people complain about R being slow but it’s really not that slow – Many packages such as Rcpp, ff, multicore, etc, let your code run much faster – Also, vector or matrix based R is pretty much as fast as anything • If 1 month after starting to use R you are still copying and pasting, stop what you’re doing and take a day to learn ESS or something similar – If you don’t use R often you can probably ignore this – (P.s. I copied and pasted for 2 or 3 years before using ESS) 3 – Data processing – my wildmice pipeline • We have data on 69 mice • Primary goals of this study – Recombination • Build rate maps for different subspecies • Find motifs – Population genetics • Relatedness, history, variation • Admixture Caroli N=1 - 40X – WDIS = Wild derived inbred strain Famulus N=1 - 40X - WDIS F WildSpret N=1 - 40X - WDIS W WildDom N=1 - 40X - WDIS W W Migration France M. m. Domesticus N=20 - 10X - Wild C weight Classical 0.5 M. m. musculus M. m. Castaneus Taiwan 0.2 N=1 - 40X - WDIS N=10 - 30X - Wild WildCast 10 s.e. 0.1 N=13- 40X – Lab strains WildMus India 0 0.0 W N=1 - 40X - WDIS N=20 - 10X - Wild 0.3 0.4 6 pops – 20 French, 20 Taiwan, 10 Indian, 17 Strains, 1 Fam, 1 Caroli bwa aln –q 10 Stampy –bamkeepgoodreads Add Read group info Merge into library level BAM using picard MergeSamFiles Picard markDuplicates Merge into sample level BAM Use GATK RealignerTargetCreator on each population Realign using GATK IndelRealigner per BAM Use GATK UnifedGenotyper on each population to create a list of putative variant sites. GATK BaseRecalibrator to generate recalibration tables per mouse GaTK PrintReads to apply recalibration 69 analysis ready BAMS! Example for 1 mus caroli (~2.5 GB genome ~50X coverage) Downloaded 95GB of gzipped .sra (15 files) Turned back into FQs (relatively fast) (30 files) bwa – about 2 days at 40 AMD cores (86 GB output, 30 files) Merged 30 -> 15 files (215 GB) stampy – cluster 3 – about 2-3 days, 1500 jobs (293 GB output, 1500 files) Merge stampy jobs together, turn into BAMs (220 GB 15 files) Merge library BAMs together, then remove duplicates per library, then merge and sort into final BAM (1 output, took about 2 days, 1 AMD) 1BAM, 170 GB Indel realignment – find intervals – 16 Intel cores, fast (30 mins) Apply realignment – 1 intel core – slower 1 BAM, 170 GB BQSR – call putative set of variants – 16 intel cores – (<2 hours) BQSR – generate recalibration tables – 16 intel cores – 10.2 hours (note – used relatively new GATK which allows multi-threading for this) BQSR – output – 1 Intel core – 37.6 hours 1 BAM, 231 GB NOTE: GATK also has scatter-gather for cluster work – probably worthwhile to investigate if you’re working on a project with 10T+ data Wildmice – calling variants • We made two sets of callsets – 3 population specific (Indian, French, Taiwanese), principally for estimating recombination rate • FP susceptible – prioritize low error at the expense of sensitivity – Combined – for pop gen • We used the GATK to call variants and VQSR to filter What is the VQSR? (Variant Quality Score Recalibrator) model PDF 12 12 HaplotypeScore HaplotypeScore 10 10 lod !4 8 !2 6 0 4 2 2 4 8 6 ! filtered ! retained 4 2 0 0 !2 !1 0 1 2 3 4 !2 !1 0 QD 1 2 3 4 QD 10 8 training 6 4 ! neg ! pos HaplotypeScore Take raw callset. Split into known and novel (array, dbSNP, etc) Split into known and novel Fit a 12 Gaussian Mixture Model on QC parameters on known Keep10the novel that’s close to the GMM, remove if far away 12 HaplotypeScore outcome 8 6 4 2 2 0 0 !2 !1 0 QD 1 2 3 4 Ti/Tv -> Expect ~2.15 genome wide Higher in genic regions novelty !2 !1 0 QD 1 2 3 4 ! novel ! known 8 It’s a good idea to benchmark your callsets and decide on the one with the parameters that suit the needs of your project (like sensitivity (finding everything) vs specificity (being right)) Population Training Sensitivity HetsInHomE chrXHetE nSNPs TiTv arrayCon arraySen French Array Filtered 95 0.64 1.97 12,957,830 2.20 99.08 94.02 French Array Filtered 97 0.72 2.28 14,606,149 2.19 99.07 96.01 French Array Filtered 99 1.12 3.62 17,353,264 2.16 99.06 98.09 French Array Not Filt 95 2.06 5.82 18,071,593 2.14 99.07 96.58 French Array Not Filt 97 2.97 8.24 19,369,816 2.10 99.07 98.01 French French French French French Array Not Filt 17 Strains 17 Strains 17 Strains Hard Filters 99 95 97 99 NA 6.11 1.29 2.20 4.19 5.36 15.73 3.89 6.52 11.63 16.37 22,008,978 16,805,717 18,547,713 20,843,679 19,805,592 2.01 2.14 2.11 2.04 2.06 99.06 99.07 99.07 99.06 99.09 99.20 93.49 96.49 98.62 96.96 Sensitivity – You set this – How much of your training set do you want to recover HetsInHomE – Look at homozygous regions in the mouse – how many hets do you see chrXHetE – Look at chromosome X in males – how many hets do you see nSNPs – number of SNPs TiTv – transition transversion ratio – expect ~2.15 for real, 0.5 for FP arrayCon – Concordance with array genotypes arraySen – Sensitivity for polymorphic array sites We chose a dataset for recombination rate estimation with low error rate but still a good number of SNPs Population Training Sensitivity HetsInHomE chrXHetE nSNPs TiTv arrayCon arraySen Taiwan Array Not Filt 95 2.05 11.20 36,344,063 2.12 NA NA Taiwan Array Not Filt 97 2.87 14.67 39,183,932 2.10 NA NA Taiwan Taiwan Taiwan Taiwan Taiwan Array Not Filt 17 Strains 17 Strains 17 Strains Hard Filters 99 95 97 99 NA 6.34 1.83 2.16 3.66 6.11 25.57 10.32 11.20 15.80 19.44 42,864,322 29,748,456 34,112,325 39,549,666 33,692,857 2.05 2.11 2.11 2.08 2.04 NA NA NA NA NA NA NA NA NA NA Indian Array Not Filt 95 1.11 1.80 66,190,390 2.18 NA NA Indian Array Not Filt 97 1.59 2.57 71,134,757 2.16 NA NA Indian Indian Indian Indian Array Not Filt 17 Strains 17 Strains 17 Strains 99 95 97 99 3.70 0.67 1.09 2.63 5.56 1.16 1.63 3.31 78,220,348 57,674,209 65,981,654 75,103,886 2.11 2.18 2.17 2.13 NA NA NA NA NA NA NA NA Indian Hard Filters NA 5.41 72.61 78,487,616 2.10 NA NA All Array Not Filt 95 1.90 8.95 140,827,810 2.04 99.07 96.74 All Array Not Filt 97 2.38 13.99 160,447,255 2.03 99.07 98.20 All Array Not Filt 99 4.52 22.73 184,977,157 1.99 99.06 99.36 Some of the datasets are extremely big Combined datasets allow us to better evaluate differences between populations Notes – VQSR sensitivity not always “calibrated” – Note: Be VERY skeptical of the work of others wrt sensitivity, specificity, that depends on NGS. Different filtering on different datasets can often explain alot French and Taiwanese very inbred, not so for the Indian mice Homozygosity (Blue) for chromosome 19 10 10 6 4 10 0 Taiwan 10 20 30 Position Mbp Homozygosity = red Huge Taiwan and French bottleneck, India OK 20 India 5 Mouse 0 2 15 Mouse 8 20 France Homozygosity (Red) for chromosome 19 40 50 60 30 Position Mbp 40 50 60 Caroli WildMus WildDom Classical Recent Admixture is visible in French and Taiwanese populations France Migration weight WildCast 0.5 Taiwan India 0 WildSpret 10 s.e. 0.0 Famulus 0.1 0.2 0.3 0.4 Admixture / introgression common Our Domesticus hotspots are enriched in an already known Domesticus motif 0.4 0.2 0.3 Broad scale correlation is conserved between subspecies, like in humans vs chimps 0.1 Pearson correlation 0.5 0.6 French hotspots are cold in Taiwan and vice-versa 0 1000 2000 3000 Window size (kb) 4000 5000 Conclusions • 1 – Don’t crash the server • 2 – There are tricks to make R faster • 3 – Sequencing data is big, slow and unwieldy. But it can tell us a lot Acknowledgements • • • • • • Simon Myers – supervisor Jonathan Flint, Richard Mott – collaborators Oliver Venn – Recombination work for wild mice Kiran Garimella – GATK guru Cai Na – Pre-processing pipeline Winni Kretzschmar – ESS, and many other things he does I copy • Amelie Baud, Binnaz Yalcin, Xiangchao Gan and many others for the wild mice