R and Bioconductor: open source software for analysing genomics

advertisement
R and Bioconductor: open source
software for analysing genomics data
Belinda Phipson
1 February 2016
1
Who am I?
Who I’m not
• Trained as a statistician
• A geneticist or biologist
• Moved into Bioinformatics in
February 2007
• A software engineer
• User of R for the last 10 years
• Someone who is awesome
with “computer stuff”
• Contributor and maintainer of
R Bioconductor packages
2
What is R?
“R is a language and environment for
statistical computing and graphics”
3
What is R?
• R is the open source alternative to the commercially
available S software
– Free Software Foundation’s GNU General Public License
• R is more flexible compared to other statistical software
– Users can easily define their own functions and create
packages
– C, C++ and Fortran code can be linked
• Runs on a wide variety of platforms
• RStudio: https://www.rstudio.com/ (a talk on it’s own)
4
R has become very popular
IEEE spectrum language popularity rankings
2015
2014
http://r4stats.com/articles/popularity/5
Most popular tool for “analytics”
The competing statistical software
tools are very expensive
Analytics tools used by respondents to the 2015
Rexer Analytics Survey. In this view, each
respondent was free to check multiple tools.
6
http://r4stats.com/articles/popularity/
R is a powerful tool
• Command line driven
• Empowers people to do reproducible research
• Open source (also) means that you can learn from
others
• Ability to produce publication quality graphics
• Access to cutting edge techniques
• Access to lots of packages
7
Quick graphics examples
8
Example of a figure made using base R
for publication in Genome Biology
par(mar=c(6,6,5,3)+0.1)
layout(matrix(c(1,2,3,3,4,4),ncol=2, byrow = TRUE))
group<-rep(c(0,1),c(160,283))
design<-model.matrix(~group)
stripchart(egDMDV[2,]~design[,2],method="jitter",pch=16,cex=0.7,col=c(4,2),
group.names=c("Normal","Cancer"),ylab="M
values",vertical=T,cex.axis=1.5,cex.lab=2)
title("(A) Top DM CpG",cex.main=2)
stripchart(egDMDV[1,]~design[,2],method="jitter",pch=16,cex=0.7,col=c(4,2),
group.names=c("Normal","Cancer"),ylab="M
values",vertical=T,cex.axis=1.5,cex.lab=2)
title("(B) Top DV CpG",cex.main=2)
par(mar=c(6,6,5,1)+0.1)
z<-getLeveneResiduals(egDMDV,design,coef=2)
barplot(z$data[2,],names="",col=c(rep(4,160),rep(2,283)),xlab="Samples",
ylab="Absolute deviation",cex.lab=2,cex.axis=1.5,border=NA)
text(80,3,labels="Normal tissue",col=4,cex=2)
text(400,3,labels="Cancer tissue",col=2,cex=2)
title("(C) Top DM CpG",cex.main=2)
barplot(z$data[1,],names="",col=c(rep(4,160),rep(2,283)),xlab="Samples",
ylab="Absolute deviation",cex.lab=2,cex.axis=1.5,border=NA)
text(80,3,labels="Normal tissue",col=4,cex=2)
text(380,3,labels="Cancer tissue",col=2,cex=2)
title("(D) Top DV CpG",cex.main=2)
9
Example of
awesome plot
using the “Gviz”
library
Combines 3
different data types
236 lines of R code
Made by Jovana
Maksimovic 10
The R community
Packages
Melbourne Users
of R Network
Yearly
conference
Where to get help
Slide from Joseph B Rickert
11
R is pretty awesome…
(Except not so much for high dimensional data)
12
Example: what’s going on in a tumour?
RNA
sample
Tumour
(Lots of steps…)
Count data
Can this inform us on which
drugs to give the patient?
20 000 rows
13
Does the expression level of a gene change
between cancer and normal samples?
Normal samples
DNER
11.15
10.01
10.97
11.03
Cancer samples
11.49
5.53
4.31
3.39
3.51
3.98
Is this difference
statistically significant?
14
The first command I typed:
> ?t.test
starting httpd help server ... done
15
t.test {stats}
R Documentation
Student's t-Test
Description
Performs one and two sample t-tests on vectors of data.
Usage
t.test(x, ...)
## Default S3 method:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
## S3 method for class 'formula'
t.test(formula, data, subset, na.action, ...)
Arguments
x
a (non-empty) numeric vector of data values.
y
an optional (non-empty) numeric vector of data values.
alternative a character string specifying the alternative hypothesis, must be one of
"two.sided" (default), "greater" or "less". You can specify just the initial
letter.
mu
a number indicating the true value of the mean (or difference in means if you
are performing a two sample test).
paired
a logical indicating whether you want a paired t-test.
var.equal
a logical variable indicating whether to treat the two variances as being equal.
16
Everything in R is vectorised*
(even scalars)
> ?t.test
starting httpd help server ... done
> dner
TCGA-B0-5709-11 TCGA-CW-5591-11 TCGA-CW-6087-11 TCGA-CW-5585-11 TCGA-CW-5589-11
11.150319
10.006868
10.970987
11.029617
11.492993
TCGA-B0-5695-01 TCGA-B0-4710-01 TCGA-B4-5836-01 TCGA-B2-4099-01 TCGA-B0-5083-01
5.528162
4.314067
3.393850
3.508280
3.981251
> group
[1] Normal Normal Normal Normal Normal Cancer Cancer Cancer Cancer Cancer
Levels: Cancer Normal
* Another talk on it’s own
17
> t.test(dner~group)
Welch Two Sample t-test
data: dner by group
t = -14.864, df = 6.8485, p-value = 1.828e-06
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-7.869296 -5.700774
sample estimates:
mean in group Cancer mean in group Normal
4.145122
10.930157
> t.test(dner[group=="Cancer"],dner[group=="Normal"])
18
But I want to test 20 000 genes
Cancer samples
Normal samples
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Gene7
Gene8
Gene9
Gene10
8.04
11.32
7.88
9.67
10.93
9.68
11.07
6.57
6.42
9.61
7.75
10.71
6.57
10.66
12.24
10.53
10.23
6.80
6.11
9.41
7.06
11.51
7.46
9.78
11.15
9.21
10.34
7.09
5.96
8.43
7.69
11.50
6.70
10.02
9.81
10.30
11.34
6.74
6.19
9.37
7.49
11.43
7.35
10.43
11.20
9.34
11.01
6.33
5.85
9.70
7.96
10.92
8.46
10.64
10.61
10.87
10.98
6.03
5.76
9.24
7.23
10.09
8.82
11.31
11.31
9.61
11.25
7.13
6.28
9.70
7.54
10.88
6.69
9.91
11.32
9.45
10.16
7.01
6.07
10.20
8.07
11.29
8.63
10.46
9.78
9.83
10.92
7.67
6.97
8.47
7.98
11.16
6.77
5.64
9.83
9.41
10.30
3.98
7.19
4.70
20 000 more rows
19
Many functions in base R are
not designed for matrices
> t.test(logCounts[1:10,]~group)
Welch Two Sample t-test
data: nC[1:10, ] by group
t = -2.4845, df = 96.502, p-value = 0.0147
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
-1.6840884 -0.1882764
sample estimates:
mean in group Cancer mean in group Normal
8.456263
9.392445
20
You need to do some form of looping
(there are a number of ways to do this)
21
> tstat <- rep(NA,10)
> tstat
[1] NA NA NA NA NA NA NA NA NA NA
Set up empty vectors to store
important statistics
> Pval <- rep(NA,10)
> Pval
[1] NA NA NA NA NA NA NA NA NA NA
> for(i in 1:10){
+
out <- t.test(logCounts[i,]~group)
+
tstat[i] <- out$statistic
+
Pval[i] <- out$p.value
+ }
> cbind(tstat,Pval)
tstat
Pval
[1,] 0.65647357 0.5299499
[2,] -1.65581800 0.1402197
[3,] 1.28976536 0.2444178
[4,] -0.50571919 0.6380099
[5,] -0.96423394 0.3636724
[6,] 0.05651803 0.9563169
[7,] -0.25195820 0.8074337
[8,] -0.51537649 0.6316880
22
Statistical calculations are
based on matrix algebra
… and matrix calculations in R are a lot faster than running for loops
23
Matrix operations in R: * vs %*%
> mat
[,1] [,2] [,3] [,4] [,5]
[1,]
1
2
3
4
5
[2,]
1
2
3
4
5
[3,]
1
2
3
4
5
[4,]
1
2
3
4
5
[5,]
1
2
3
4
5
> mat * mat
> mat %*% mat
[,1] [,2] [,3] [,4] [,5]
[,1] [,2] [,3] [,4] [,5]
[1,]
1
4
9
16
25
[1,]
15
30
45
60
75
[2,]
1
4
9
16
25
[2,]
15
30
45
60
75
[3,]
1
4
9
16
25
[3,]
15
30
45
60
75
[4,]
1
4
9
1660 25 75
[4,]
15
30
45
[5,]
1
4
9
1660 25 75
[5,]
15
30
45
> mat
[,1] [,2] [,3] [,4] [,5]
[1,]
1
2
3
4
5
[2,]
1
2
3
4
5
[3,]
1
2
3
4
5
[4,]
1
2
3
4
5
[5,]
1
2
3
4
5
Multiplies each
Multiplies
each row by
element together
each column and adds
elements together
24
Why don’t R functions do this automatically?
• This problem has not commonly been encountered in
classical statistical applications
• The R Core team is reluctant to make “drastic” changes to
base R
• The Bioconductor project started in 2001 to address the
unique issues facing researchers in bioinformatics
25
Bioinformatics and R: Bioconductor
“Bioconductor is an open source, open development
software project to provide tools for the analysis and
comprehension of high-throughput genomic data. It is
based primarily on the R programming language”
26
First Bioconductor paper published in 2004
27
Goals of the Bioconductor project:
• fostering collaborative development and widespread use of
innovative software
• reducing barriers to entry into interdisciplinary scientific research
• promoting the achievement of remote reproducibility of research
results
28
Bioconductor packages (as of a few days ago)
1. Software (n=1104)
- Statistical methods
2. AnnotationData (n=895)
- where does this gene come from in the genome?
3. ExperimentData (n=257)
- packages that contain data used for illustrative purposes
- e.g. datasets for books
29
Download stats for Bioconductor packages
30
Testing 20 000 genes is easy (and fast)
with the right package
Normal samples
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Gene7
Gene8
Gene9
Gene10
8.04
11.32
7.88
9.67
10.93
9.68
11.07
6.57
6.42
9.61
7.75
10.71
6.57
10.66
12.24
10.53
10.23
6.80
6.11
9.41
7.06
11.51
7.46
9.78
11.15
9.21
10.34
7.09
5.96
8.43
7.69
11.50
6.70
10.02
9.81
10.30
11.34
6.74
6.19
9.37
Cancer samples
7.49
11.43
7.35
10.43
11.20
9.34
11.01
6.33
5.85
9.70
7.96
10.92
8.46
10.64
10.61
10.87
10.98
6.03
5.76
9.24
7.23
10.09
8.82
11.31
11.31
9.61
11.25
7.13
6.28
9.70
7.54
10.88
6.69
9.91
11.32
9.45
10.16
7.01
6.07
10.20
8.07
11.29
8.63
10.46
9.78
9.83
10.92
7.67
6.97
8.47
7.98
11.16
6.77
5.64
9.83
9.41
10.30
3.98
7.19
4.70
20 000 more rows
31
>
>
>
>
>
library(limma)
design <- model.matrix(~group)
fit <- lmFit(logCounts,design)
fit <- eBayes(fit,trend=TRUE)
topTable(fit,coef=2)
> design
Int NormalVsCancer
1
1
1
2
1
1
3
1
1
4
1
1
5
1
1
6
1
0
7
1
0
8
1
0
9
1
0
10
1
0
> topTable(fit,coef=2)
logFC
AveExpr
t
P.Value
adj.P.Val
B
OVCH2|341277
9.169723 -1.567496 10.074333 7.003360e-07 0.008273631 6.168942
SERPINA5|5104
8.864641 3.466592 9.381868 1.419199e-06 0.008273631 5.571076
BMP8A|353500
-3.050923 2.523338 -8.755693 2.788463e-06 0.008273631 4.985163
DDB2|1643
-2.404746 4.612741 -8.652805 3.126878e-06 0.008273631 4.884508
SGK2|10110
2.774716 4.149838 8.596655 3.330052e-06 0.008273631 4.829035
TFAP2A|7020
4.618654 2.034020 8.571366 3.426171e-06 0.008273631 4.803926
AP1M2|10053
3.060636 4.677211 8.278972 4.783436e-06 0.009096147 4.507849
ST6GALNAC2|10610 2.998590 2.368024 8.231874 5.051767e-06 0.009096147 4.459151
DNER|92737
6.813317 1.705048 8.109853 5.825411e-06 0.009096147 4.331658
LOC91316|91316
-2.373354 3.688923 -7.989076 6.718301e-06 0.009096147 4.203558
32
Getting a package into Bioconductor
33
Submitting your package
• Every package goes through a curation process
• Every package must meet certain standards to be
accepted
– http://www.bioconductor.org/developers/packageguidelines/
• Every package must have proper documentation
– Help pages and user manual is compulsory
34
Every package must compile and build
successfully on multiple platforms:
• Linux
• Windows
• MacOS
(I was getting so frustrated debugging my
package I went out and bought a Mac.)
35
Benefits
• Easy to install Bioconductor packages from within R:
> source("https://bioconductor.org/biocLite.R")
> biocLite("missMethyl")
• Software packages are built daily, someone will tell
you if your package has broken!
• Bioconductor developers mailing list
• Convenient to distribute your package
• Publishing methods – a Bioconductor package gives
you kudos
36
“How I decide when to trust an R package”
- Simply Statistics, Jeff Leek
Picture from Jeff Leek’s blog post
37
Pros which can also be cons
• Anyone can submit a package
– Coding style is mostly curated, BAD CODE IS NOT
• Dependencies
• It can take a while to get through the curation process
and have your package accepted
Top tip: write a package with a “buddy”
38
Bioconductor community
• Yearly Bioconductor conference which highlights
current developments
– Workshops
– Developer Day
• Bioconductor courses – all materials available online:
https://www.bioconductor.org/help/course-materials/
• Bioconductor is committed to open source
– All licenses are either Artistic 2.0, GPL2 or BSD
39
Where to get help
• Excellent support site
40
Summary
• R and Bioconductor has played a pivotal role in
shaping how Bioinformaticians analyse their data
• Open source is the key to their success
• There is a strong worldwide community of R and
Bioconductor users and developers
• You never stop learning about cool stuff in R!
41
Acknowledgements
MCRI Bioinformatics
- Alicia Oshlack
- Simon Sadedin
- Jovana Maksimovic
- Harriet Dashnow
- Anthony Hawkins
MCRI Statistical Genetics
- Ashley Farlow
WEHI
- Alan Rubin
The internet
42
Download