Some hints and tips for bioinformatics in the real world Servers, R, online reports and an example Robert William Davies Feb 5, 2015 Outline • • • • 1 - How to be a good server citizen 2 – Some useful tricks in R (including ESS) 3 – Project reporting using github + knitr 4 – NGS pipeline example – wild mice 1 - How to be a good server citizen • Server throughput is affected by – CPU usage • cat /proc/cpuinfo, top or htop – RAM • top and htop – Swap space • When your computer doesn’t have enough RAM for open jobs, it puts some on the hard disk. This is BAD – Disk Input/Output and space • iostat, df, du Check CPU information using cat /proc/cpuinfo rwdavies@dense:~$ processor : vendor_id : cpu family : model : model name : stepping : microcode : cpu MHz : cache size : physical id : cat /proc/cpuinfo | head 0 AuthenticAMD 21 2 AMD Opteron(tm) Processor 6344 0 0x600081c 1400.000 2048 KB 0 rwdavies@dense:~$ cat /proc/cpuinfo | grep processor | wc -l 48 Check RAM amount + usage and CPU usage htop and top RAM – 512GB total 142 in use (rest free) Load average – average over 1, 5, 15 minutes 48 cores Check disk use using rwdavies@dense:~$ iostat -m -x 2 iostat Relatively unused High sequential reading (fast!) Also note from top and htop state – D = limited by IO There are also ways to optimize disk use for different IO requirements on a server – ask Warren Kretschmar Check disk usage using du and df Get sizes of directories -h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G) -s, --summarize display only a total for each argument Get available disk space for drives 1 - How to be a good server citizen Take away • CPU usage – Different servers / groups have different philosophies – In general, try for load <= number of cores • RAM – High memory jobs can take down a server very easily by pushing RAM to swap and will make others very mad at you – best to avoid • Disk IO – For IO bound jobs you often get better combined throughput from running one or a few jobs than many in parallel. Test to determine which is best for you – Also don’t forget to try and avoid clogging up disks 2 – Some useful tricks in R (including ESS) • R is a commonly used programming language / statistical environment • Pros – (Almost) everyone uses it (especially in Genetics), so it’s very easy to use for collaborations – Very easy to learn and use • Cons – It’s “slow” – It can’t do X • But! R can be faster, and it might be able to do X! Here I’ll show a few tricks R editors ESS (Emacs speaks statistics) • There are many R editors I won’t talk about here (Rstudio comes to mind) • Emacs is a general purpose text editor. There exists an extension to emacs called ESS allowing you to use R within emacs • This allows you to analyze data on a server very nicely using a split screen environment and keyboard shortcuts to run your code I have code on the left An R terminal on the right Running a line of code ctrl-c ctr-j Running a paragraph of code ctrl-c ctrl-p Switching windows ctrl-x o Its easy to find cheatsheets for editors like emacs+ESS Google: ESS cheat sheet http://ess.r-project.org/refcard.pdf C- = ctrl M- = option key R samtools • R package to give you basic access to BAM files. Useful if you want to manually interrogate BAM files • For example, get number of reads in an interval, then can calculate average mapping quality, etc. R mclapply • lapply – apply a function to members of a list • mclapply – do it multicore! • Note there exists a spawning cost depending on memory of current R job Not 19X faster due to chromosome size differences Also I ran this on a 48 core server with a load of 40 R ff • Save R objects (like matrices) to disk in a nonhuman-readable format. Later, you can reload part of a matrix instead of the whole thing Example – matrix of 583,937 rows, 106 columns Accessing 1 entry takes 0.01 seconds with ff and 2 seconds when you load the whole thing into R first. Bonus – you can write to different entries in an ff file using different processes! R Rcpp The only thing I’ve ever been stuck on running fast in R is long for loops with dependent elements, like when writing an HMM Here, I used c++ in R to take a reference genome (size 60M) coded as an integer 0 to 3, and calculate number of Kmers of size K I write the c++ as a vector in R, compile it using R (which takes a few seconds), then call the function as I would any other in R Works with multi-variable input and output using lists Note that you can call fancy R things from c++ 2 – Some useful tricks in R (including ESS) – Take away • A lot of people complain about R being slow but it’s really not that slow • Lots of packages exist for speeding up your code including Rcpp, ff, multicore, Rsamtools, etc • Spend the time finding an editor that works for you (emacs+ESS, vi, Rstudio, etc). It will save you a lot of time as you memorize keyboard shortcuts 3 – Project reporting using knitr+github • “Robbie, what if you were to use alpha=2 instead of alpha=3? Surely alpha=2 is better” • “Robbie, why don’t you try filtering out X? I think that would improve things” • “Robbie, can you send me new figures showing the effect of alpha=2”? • “Sorry actually now that I’ve thought about it I decided that alpha=3 is better” What are knitr and github • Knitr – Write R code to automatically generate PDF using latex or markdown (fancy html) files from results and parameters – When results change, your output automatically incorporates those changes! • Github – Traditionally used for hosting code, versioning, collaborating, etc. – Can also be used to host project output online Setting up a knitr+github pipeline • Cons – Takes an afternoon to set up – Everything takes ~20-60 minutes longer as you write code to put it online • Pros – You can make small changes and easily regenerate all of your downstream plots and tables – Everything is neat and organized – less scrambling to find files / code 6+ months later Real life examples • • • • My github for one of my projects https://github.com/rwdavies/hotspotDeath Kiran github for PacBio malaria sequencing https://github.com/kvg/PacBio/tree/master/r eports/FirstLook Changing small parameter Real life example 2015_01_22 Made small change to filtering condition in middle of pipeline New downstream plot is similar (but better)! 2015_01_06 Earlier version 3 - Project reporting using github + knitr – take away • Some start up cost • Once set up, allows you to very easily modify parameters and re-run analysis • Easy to return to and look up how you made all your figures, tables, etc • I will use this or something similar for every subsequent project I’m involved with 4 – An example of an NGS pipeline – wild mice analysis • We have data (fastQ’s) on 69 mice • We want VCFs (genotypes at SNPs) to build recombination rate maps and to look at population genetic type analyses • Here I will discuss what the pipeline involved in terms of software + run times Caroli N=1 – 40X Famulus N=1 – 40X F WildSpret N=1 – 40X W WildDom N=1 - 40X W W Migration France M. m. Domesticus N=20 - 10X W C weight Classical 0.5 M. m. musculus WildMus India 0 M. m. Castaneus Taiwan 0.1 0.2 N=1 - 40X N=10 – 30X WildCast 10 s.e. 0.0 N=13- 40X N=1 – 40X N=20 – 10X 0.3 0.4 6 pops – 20 French, 20 Taiwan, 10 Indian, 17 Lab mice, 1 Fam, 1 Caroli bwa aln –q 10 Stampy –bamkeepgoodreads Add Read group info Merge into library level BAM using picard MergeSamFiles Picard markDuplicates Merge into sample level BAM Use GATK RealignerTargetCreator on each population Realign using GATK IndelRealigner per BAM Use GATK UnifedGenotyper on each population to create a list of putative variant sites. GATK BaseRecalibrator to generate recalibration tables per mouse GaTK PrintReads to apply recalibration 69 analysis ready BAMS! Example for 1 mus caroli (~2.5 GB genome ~50X coverage) Downloaded 95GB of gzipped .sra (15 files) Turned back into FQs (relatively fast) (30 files) bwa – about 2 days at 40 AMD cores (86 GB output, 30 files) Merged 30 -> 15 files (215 GB) stampy – cluster 3 – about 2-3 days, 1500 jobs (293 GB output, 1500 files) Merge stampy jobs together, turn into BAMs (220 GB 15 files) Merge library BAMs together, then remove duplicates per library, then merge and sort into final BAM (1 output, took about 2 days, 1 AMD) 1BAM, 170 GB Indel realignment – find intervals – 16 Intel cores, fast (30 mins) Apply realignment – 1 intel core – slower 1 BAM, 170 GB BQSR – call putative set of variants – 16 intel cores – (<2 hours) BQSR – generate recalibration tables – 16 intel cores – 10.2 hours (note – used relatively new GATK which allows multi-threading for this) BQSR – output – 1 Intel core – 37.6 hours 1 BAM, 231 GB NOTE: GATK also has scatter-gather for cluster work – probably worthwhile to investigate if you’re working on a project with 10T+ data Wildmice – calling variants • We made two sets of callsets using the GATK – 3 population specific (Indian, French, Taiwanese), principally for estimating recombination rate • FP susceptible – prioritize low error at the expense of sensitivity – Combined – for pop gen • We used the GATK to call variants and VQSR to filter What is the VQSR? (Variant Quality Score Recalibrator) model PDF 12 12 HaplotypeScore HaplotypeScore 10 10 lod !4 8 !2 6 0 4 2 2 4 8 6 ! filtered ! retained 4 2 0 0 !2 !1 0 1 2 3 4 !2 !1 0 QD 1 2 3 4 QD 10 8 training 6 4 ! neg ! pos HaplotypeScore Take raw callset. Split into known and novel (array, dbSNP, etc) Split into known and novel Fit a 12 Gaussian Mixture Model on QC parameters on known Keep10the novel that’s close to the GMM, remove if far away 12 HaplotypeScore outcome 8 6 4 2 2 0 0 !2 !1 0 QD 1 2 3 4 Ti/Tv -> Expect ~2.15 genome wide Higher in genic regions novelty !2 !1 0 QD 1 2 3 4 ! novel ! known 8 It’s a good idea to benchmark your SNP data and decide on the one with the parameters that suit the needs of your project (like sensitivity (finding everything) vs specificity (being right)) Population Training Sensitivity HetsInHomE chrXHetE nSNPs TiTv arrayCon arraySen French Array Filtered 95 0.64 1.97 12,957,830 2.20 99.08 94.02 French Array Filtered 97 0.72 2.28 14,606,149 2.19 99.07 96.01 French Array Filtered 99 1.12 3.62 17,353,264 2.16 99.06 98.09 French Array Not Filt 95 2.06 5.82 18,071,593 2.14 99.07 96.58 French Array Not Filt 97 2.97 8.24 19,369,816 2.10 99.07 98.01 French French French French French Array Not Filt 17 Strains 17 Strains 17 Strains Hard Filters 99 95 97 99 NA 6.11 1.29 2.20 4.19 5.36 15.73 3.89 6.52 11.63 16.37 22,008,978 16,805,717 18,547,713 20,843,679 19,805,592 2.01 2.14 2.11 2.04 2.06 99.06 99.07 99.07 99.06 99.09 99.20 93.49 96.49 98.62 96.96 French and Taiwanese very inbred, not so for the Indian mice Homozygosity (Blue) for chromosome 19 10 10 6 4 10 0 Taiwan 10 20 30 Position Mbp Homozygosity = red Huge Taiwan and French bottleneck, India OK 20 India 5 Mouse 0 2 15 Mouse 8 20 France Homozygosity (Red) for chromosome 19 40 50 60 30 Position Mbp 40 50 60 Caroli WildMus WildDom Classical Recent Admixture is visible in French and Taiwanese populations France Migration weight WildCast 0.5 Taiwan India 0 WildSpret 10 s.e. 0.0 Famulus 0.1 0.2 0.3 0.4 Admixture / introgression common Our Domesticus hotspots are enriched in an already known Domesticus motif 0.4 0.2 0.3 Broad scale correlation is conserved between subspecies, like in humans vs chimps 0.1 Pearson correlation 0.5 0.6 French hotspots are cold in Taiwan and vice-versa 0 1000 2000 3000 Window size (kb) 4000 5000 4 – An example of an NGS pipeline – wild mice analysis – take away • All the stuff involving BAMs are slow. Take care and try to avoid mistakes, but redo analyses if appropriate to fix them • If you’re doing human stuff, you can probably get away with Ti/Tv for SNP filtering. If not human, try to set up benchmarks to guide SNP calling and filtering • Boy do I wish I had used some sort of knitr + github reporting system (for the downstream stuff) Extra 1 – Useful random linux • screen – Log onto server, start a “screen” session. You can then disconnect from the server and reconnect at a later time with all your programs open • Set up password-less ssh using public-private keys! – Google “password less ssh” Extra 2 – Give some thought to folder organization Education A Quick Guide to Organizing Computational Biology Projects William St afford Noble1,2* 1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and Engineering, University of Washington, Seatt le, Washington, United States of America understanding your work or who may be under a common root directory. The exception to this rule is source code or evaluating your research skills. Most comMost bioinformatics coursework focusscripts that are used in multiple projects. monly, however, that ‘‘someone’’ is you. A es on algorithms, with perhaps some Each such program might have a project few months from now, you may not components devoted to learning prodirectory of its own. remember what you were up to when you gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level use existing bioinformatics software. Unnot remember what conclusions you drew. organization that is logical, with chronofortunately, for students who are preparlogical organization at the next level, and You will either have to then spend time ing for a research career, this type of logical organization below that. A sample reconstructing your previous experiments curriculum fails to address many of the project, called ms ms , is shown in Figure 1. or lose whatever insights you gained from day-to-day organizational challenges asAt the root of most of my projects, I have a those experiments. sociated with performing computational dat a directory for storing fixed data sets, a This leads to the second principle, experiments. In practice, the principles r es ul t s directory for tracking computawhich is actually more like a version of behind organizing and documenting tional experiments peformed on that data, Murphy’s Law: Everything you do, you computational experiments are often a doc directory with one subdirectory per will probably have to do over again. learned on the fly, and this learning is manuscript, and directories such as s r c Inevitably, you will discover some flaw in strongly influenced by personal predilecfor source code and bi n for compiled your initial preparation of the data being Figure 1. Directory structur e for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset of Int roduct ion Conclusions • Please don’t crash the server • Please don’t hog the server without reason (especially RAM and disk IO!) • Consider something like emacs and ESS for quick programming in R • R is pretty fast if you program it right, and there are lots of packages and tricks to make it faster • Consider something like iPython or knitr(+/-github) to document your work and auto-generate reports on long projects • Sequencing data is big, slow and unwieldy. But it is very informative! Acknowledgements • • • • • • • Simon Myers – supervisor Jonathan Flint, Richard Mott – close collaborators Oliver Venn – Recombination work for wild mice Kiran Garimella – GATK, github Cai Na – Pre-processing pipeline Winni Kretzschmar – ESS, many other things Amelie Baud, Binnaz Yalcin, Xiangchao Gan and many others for the wild mice