Robbie Davies

advertisement
Some hints and tips for
bioinformatics in the real world
Servers, R, online reports and an
example
Robert William Davies
Feb 5, 2015
Outline
•
•
•
•
1 - How to be a good server citizen
2 – Some useful tricks in R (including ESS)
3 – Project reporting using github + knitr
4 – NGS pipeline example – wild mice
1 - How to be a good server citizen
• Server throughput is affected by
– CPU usage
• cat /proc/cpuinfo, top or htop
– RAM
• top and htop
– Swap space
• When your computer doesn’t have enough RAM for
open jobs, it puts some on the hard disk. This is BAD
– Disk Input/Output and space
• iostat, df, du
Check CPU information using
cat /proc/cpuinfo
rwdavies@dense:~$
processor
:
vendor_id
:
cpu family
:
model
:
model name
:
stepping
:
microcode
:
cpu MHz
:
cache size
:
physical id
:
cat /proc/cpuinfo | head
0
AuthenticAMD
21
2
AMD Opteron(tm) Processor 6344
0
0x600081c
1400.000
2048 KB
0
rwdavies@dense:~$ cat /proc/cpuinfo | grep processor | wc -l
48
Check RAM amount + usage and CPU usage
htop and top
RAM – 512GB total
142 in use (rest free)
Load average – average over
1, 5, 15 minutes
48 cores
Check disk use using
rwdavies@dense:~$ iostat -m -x 2
iostat
Relatively unused
High sequential reading (fast!)
Also note from top and htop
state – D = limited by IO
There are also ways to optimize disk use for different IO
requirements on a server – ask Warren Kretschmar
Check disk usage using
du and df
Get sizes of directories
-h, --human-readable
print sizes in human readable format (e.g., 1K 234M 2G)
-s, --summarize
display only a total for each argument
Get available disk space for drives
1 - How to be a good server citizen
Take away
• CPU usage
– Different servers / groups have different philosophies
– In general, try for load <= number of cores
• RAM
– High memory jobs can take down a server very easily by
pushing RAM to swap and will make others very mad at
you – best to avoid
• Disk IO
– For IO bound jobs you often get better combined
throughput from running one or a few jobs than many in
parallel. Test to determine which is best for you
– Also don’t forget to try and avoid clogging up disks
2 – Some useful tricks in R (including
ESS)
• R is a commonly used programming language /
statistical environment
• Pros
– (Almost) everyone uses it (especially in Genetics), so
it’s very easy to use for collaborations
– Very easy to learn and use
• Cons
– It’s “slow”
– It can’t do X
• But! R can be faster, and it might be able to do X!
Here I’ll show a few tricks
R editors
ESS (Emacs speaks statistics)
• There are many R editors I won’t talk about
here (Rstudio comes to mind)
• Emacs is a general purpose text editor. There
exists an extension to emacs called ESS
allowing you to use R within emacs
• This allows you to analyze data on a server
very nicely using a split screen environment
and keyboard shortcuts to run your code
I have code on the left
An R terminal on the right
Running a line of code ctrl-c ctr-j
Running a paragraph of code ctrl-c ctrl-p
Switching windows ctrl-x o
Its easy to find cheatsheets for editors
like emacs+ESS
Google: ESS cheat sheet
http://ess.r-project.org/refcard.pdf
C- = ctrl
M- = option key
R samtools
• R package to give you basic access to BAM
files. Useful if you want to manually
interrogate BAM files
• For example, get number of reads in an
interval, then can calculate average mapping
quality, etc.
R mclapply
• lapply – apply a function to members of a list
• mclapply – do it multicore!
• Note there exists a spawning cost depending
on memory of current R job
Not 19X faster due to
chromosome size differences
Also I ran this on a 48 core
server with a load of 40
R ff
• Save R objects (like matrices) to disk in a nonhuman-readable format. Later, you can reload
part of a matrix instead of the whole thing
Example – matrix of 583,937 rows, 106
columns
Accessing 1 entry takes 0.01 seconds with ff
and 2 seconds when you load the whole thing
into R first.
Bonus – you can write to different entries in an ff file using different processes!
R Rcpp
The only thing I’ve ever been stuck on running
fast in R is long for loops with dependent
elements, like when writing an HMM
Here, I used c++ in R to take a reference
genome (size 60M) coded as an integer 0 to 3,
and calculate number of Kmers of size K
I write the c++ as a vector in R, compile it
using R (which takes a few seconds), then call
the function as I would any other in R
Works with multi-variable input and
output using lists
Note that you
can call fancy R
things from c++
2 – Some useful tricks in R (including
ESS) – Take away
• A lot of people complain about R being slow
but it’s really not that slow
• Lots of packages exist for speeding up your
code including Rcpp, ff, multicore, Rsamtools,
etc
• Spend the time finding an editor that works
for you (emacs+ESS, vi, Rstudio, etc). It will
save you a lot of time as you memorize
keyboard shortcuts
3 – Project reporting using
knitr+github
• “Robbie, what if you were to use alpha=2
instead of alpha=3? Surely alpha=2 is better”
• “Robbie, why don’t you try filtering out X? I
think that would improve things”
• “Robbie, can you send me new figures
showing the effect of alpha=2”?
• “Sorry actually now that I’ve thought about it I
decided that alpha=3 is better”
What are knitr and github
• Knitr
– Write R code to automatically generate PDF using
latex or markdown (fancy html) files from results and
parameters
– When results change, your output automatically
incorporates those changes!
• Github
– Traditionally used for hosting code, versioning,
collaborating, etc.
– Can also be used to host project output online 
Setting up a knitr+github pipeline
• Cons
– Takes an afternoon to set up
– Everything takes ~20-60 minutes longer as you
write code to put it online
• Pros
– You can make small changes and easily regenerate
all of your downstream plots and tables
– Everything is neat and organized – less scrambling
to find files / code 6+ months later
Real life examples
•
•
•
•
My github for one of my projects
https://github.com/rwdavies/hotspotDeath
Kiran github for PacBio malaria sequencing
https://github.com/kvg/PacBio/tree/master/r
eports/FirstLook
Changing small parameter
Real life example
2015_01_22
Made small change to filtering condition
in middle of pipeline
New downstream plot is similar (but
better)!
2015_01_06
Earlier version
3 - Project reporting using github +
knitr – take away
• Some start up cost
• Once set up, allows you to very easily modify
parameters and re-run analysis
• Easy to return to and look up how you made
all your figures, tables, etc
• I will use this or something similar for every
subsequent project I’m involved with
4 – An example of an NGS pipeline –
wild mice analysis
• We have data (fastQ’s) on 69 mice
• We want VCFs (genotypes at SNPs) to build
recombination rate maps and to look at
population genetic type analyses
• Here I will discuss what the pipeline involved
in terms of software + run times
Caroli
N=1 – 40X
Famulus
N=1 – 40X
F
WildSpret
N=1 – 40X
W
WildDom
N=1 - 40X
W
W
Migration
France
M. m. Domesticus
N=20 - 10X
W
C
weight
Classical
0.5
M. m. musculus
WildMus
India
0
M. m. Castaneus
Taiwan
0.1
0.2
N=1 - 40X
N=10 – 30X
WildCast
10 s.e.
0.0
N=13- 40X
N=1 – 40X
N=20 – 10X
0.3
0.4
6 pops – 20 French, 20 Taiwan,
10 Indian, 17 Lab mice, 1 Fam,
1 Caroli
bwa aln –q 10
Stampy –bamkeepgoodreads
Add Read group info
Merge into library level BAM using
picard MergeSamFiles
Picard markDuplicates
Merge into sample level BAM
Use GATK RealignerTargetCreator on each population
Realign using GATK IndelRealigner per BAM
Use GATK UnifedGenotyper on each population to create a list of putative
variant sites.
GATK BaseRecalibrator to generate recalibration tables per mouse
GaTK PrintReads to apply recalibration
69 analysis ready BAMS!
Example for 1 mus caroli (~2.5 GB genome ~50X coverage)
Downloaded 95GB of gzipped .sra (15 files)
Turned back into FQs (relatively fast) (30 files)
bwa – about 2 days at 40 AMD cores (86 GB output, 30 files)
Merged 30 -> 15 files (215 GB)
stampy – cluster 3 – about 2-3 days, 1500 jobs (293 GB output, 1500
files)
Merge stampy jobs together, turn into BAMs (220 GB 15 files)
Merge library BAMs together, then remove duplicates per library, then
merge and sort into final BAM (1 output, took about 2 days, 1 AMD)
1BAM, 170 GB
Indel realignment – find intervals – 16 Intel cores, fast (30 mins)
Apply realignment – 1 intel core – slower
1 BAM, 170 GB
BQSR – call putative set of variants – 16 intel cores – (<2 hours)
BQSR – generate recalibration tables – 16 intel cores – 10.2 hours
(note – used relatively new GATK which allows multi-threading for this)
BQSR – output – 1 Intel core – 37.6 hours
1 BAM, 231 GB
NOTE: GATK also has scatter-gather for cluster work – probably
worthwhile to investigate if you’re working on a project with 10T+ data
Wildmice – calling variants
• We made two sets of callsets using the GATK
– 3 population specific (Indian, French, Taiwanese),
principally for estimating recombination rate
• FP susceptible – prioritize low error at the expense of
sensitivity
– Combined – for pop gen
• We used the GATK to call variants and VQSR to
filter
What is the VQSR? (Variant Quality Score Recalibrator)
model PDF
12
12
HaplotypeScore
HaplotypeScore
10
10
lod
!4
8
!2
6
0
4
2
2
4
8
6
!
filtered
!
retained
4
2
0
0
!2 !1
0
1
2
3
4
!2 !1 0
QD
1
2
3
4
QD
10
8
training
6
4
!
neg
!
pos
HaplotypeScore
Take raw callset. Split into known and novel (array, dbSNP, etc)
Split into known and novel
Fit a 12
Gaussian Mixture Model on QC parameters on known
Keep10the novel that’s close to the GMM, remove if far away
12
HaplotypeScore
outcome
8
6
4
2
2
0
0
!2 !1
0
QD
1
2
3
4
Ti/Tv -> Expect ~2.15 genome wide
Higher in genic regions
novelty
!2 !1 0
QD
1
2
3
4
!
novel
!
known
8 It’s a good idea to benchmark your SNP data and decide on the one with the parameters
that suit the needs of your project (like sensitivity (finding everything) vs specificity (being
right))
Population
Training
Sensitivity
HetsInHomE
chrXHetE
nSNPs
TiTv
arrayCon
arraySen
French
Array Filtered
95
0.64
1.97
12,957,830
2.20
99.08
94.02
French
Array Filtered
97
0.72
2.28
14,606,149
2.19
99.07
96.01
French
Array Filtered
99
1.12
3.62
17,353,264
2.16
99.06
98.09
French
Array Not Filt
95
2.06
5.82
18,071,593
2.14
99.07
96.58
French
Array Not Filt
97
2.97
8.24
19,369,816
2.10
99.07
98.01
French
French
French
French
French
Array Not Filt
17 Strains
17 Strains
17 Strains
Hard Filters
99
95
97
99
NA
6.11
1.29
2.20
4.19
5.36
15.73
3.89
6.52
11.63
16.37
22,008,978
16,805,717
18,547,713
20,843,679
19,805,592
2.01
2.14
2.11
2.04
2.06
99.06
99.07
99.07
99.06
99.09
99.20
93.49
96.49
98.62
96.96
French and Taiwanese very inbred, not so for the Indian mice
Homozygosity (Blue) for chromosome 19
10
10
6
4
10
0
Taiwan
10
20
30
Position Mbp
Homozygosity = red
Huge Taiwan and French
bottleneck, India OK
20
India
5
Mouse
0
2
15
Mouse
8
20
France
Homozygosity (Red) for chromosome 19
40
50
60
30
Position Mbp
40
50
60
Caroli
WildMus
WildDom
Classical
Recent Admixture is visible in French
and Taiwanese populations
France
Migration
weight
WildCast
0.5
Taiwan
India
0
WildSpret
10 s.e.
0.0
Famulus
0.1
0.2
0.3
0.4
Admixture / introgression common
Our Domesticus hotspots are
enriched in an already
known Domesticus motif
0.4
0.2
0.3
Broad scale correlation is conserved between
subspecies, like in humans vs chimps
0.1
Pearson correlation
0.5
0.6
French hotspots are cold in Taiwan and vice-versa
0
1000
2000
3000
Window size (kb)
4000
5000
4 – An example of an NGS pipeline –
wild mice analysis – take away
• All the stuff involving BAMs are slow. Take care
and try to avoid mistakes, but redo analyses if
appropriate to fix them
• If you’re doing human stuff, you can probably get
away with Ti/Tv for SNP filtering. If not human,
try to set up benchmarks to guide SNP calling and
filtering
• Boy do I wish I had used some sort of knitr +
github reporting system (for the downstream
stuff)
Extra 1 – Useful random linux
• screen
– Log onto server, start a “screen” session. You can
then disconnect from the server and reconnect at
a later time with all your programs open
• Set up password-less ssh using public-private
keys!
– Google “password less ssh”
Extra 2 – Give some thought to folder
organization
Education
A Quick Guide to Organizing Computational Biology
Projects
William St afford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seatt le, Washington, United States of America
understanding your work or who may be
under a common root directory. The
exception to this rule is source code or
evaluating your research skills. Most comMost bioinformatics coursework focusscripts that are used in multiple projects.
monly, however, that ‘‘someone’’ is you. A
es on algorithms, with perhaps some
Each such program might have a project
few months from now, you may not
components devoted to learning prodirectory of its own.
remember what you were up to when you
gramming skills and learning how to
created a particular set of files, or you may
Within a given project, I use a top-level
use existing bioinformatics software. Unnot remember what conclusions you drew.
organization that is logical, with chronofortunately, for students who are preparlogical organization at the next level, and
You will either have to then spend time
ing for a research career, this type of
logical organization below that. A sample
reconstructing your previous experiments
curriculum fails to address many of the
project, called ms ms , is shown in Figure 1.
or lose whatever insights you gained from
day-to-day organizational challenges asAt the root of most of my projects, I have a
those experiments.
sociated with performing computational
dat a directory for storing fixed data sets, a
This leads to the second principle,
experiments. In practice, the principles
r es ul t s directory for tracking computawhich is actually more like a version of
behind organizing and documenting
tional experiments peformed on that data,
Murphy’s Law: Everything you do, you
computational experiments are often
a doc directory with one subdirectory per
will probably have to do over again.
learned on the fly, and this learning is
manuscript, and directories such as s r c
Inevitably, you will discover some flaw in
strongly influenced by personal predilecfor source code and bi n for compiled
your initial preparation of the data being
Figure 1. Directory structur e for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset of
Int roduct ion
Conclusions
• Please don’t crash the server
• Please don’t hog the server without reason (especially
RAM and disk IO!)
• Consider something like emacs and ESS for quick
programming in R
• R is pretty fast if you program it right, and there are
lots of packages and tricks to make it faster
• Consider something like iPython or knitr(+/-github) to
document your work and auto-generate reports on
long projects
• Sequencing data is big, slow and unwieldy. But it is very
informative!
Acknowledgements
•
•
•
•
•
•
•
Simon Myers – supervisor
Jonathan Flint, Richard Mott – close collaborators
Oliver Venn – Recombination work for wild mice
Kiran Garimella – GATK, github
Cai Na – Pre-processing pipeline
Winni Kretzschmar – ESS, many other things
Amelie Baud, Binnaz Yalcin, Xiangchao Gan and
many others for the wild mice
Download