R and Bioconductor: open source software for analysing genomics data Belinda Phipson 1 February 2016 1 Who am I? Who I’m not • Trained as a statistician • A geneticist or biologist • Moved into Bioinformatics in February 2007 • A software engineer • User of R for the last 10 years • Someone who is awesome with “computer stuff” • Contributor and maintainer of R Bioconductor packages 2 What is R? “R is a language and environment for statistical computing and graphics” 3 What is R? • R is the open source alternative to the commercially available S software – Free Software Foundation’s GNU General Public License • R is more flexible compared to other statistical software – Users can easily define their own functions and create packages – C, C++ and Fortran code can be linked • Runs on a wide variety of platforms • RStudio: https://www.rstudio.com/ (a talk on it’s own) 4 R has become very popular IEEE spectrum language popularity rankings 2015 2014 http://r4stats.com/articles/popularity/5 Most popular tool for “analytics” The competing statistical software tools are very expensive Analytics tools used by respondents to the 2015 Rexer Analytics Survey. In this view, each respondent was free to check multiple tools. 6 http://r4stats.com/articles/popularity/ R is a powerful tool • Command line driven • Empowers people to do reproducible research • Open source (also) means that you can learn from others • Ability to produce publication quality graphics • Access to cutting edge techniques • Access to lots of packages 7 Quick graphics examples 8 Example of a figure made using base R for publication in Genome Biology par(mar=c(6,6,5,3)+0.1) layout(matrix(c(1,2,3,3,4,4),ncol=2, byrow = TRUE)) group<-rep(c(0,1),c(160,283)) design<-model.matrix(~group) stripchart(egDMDV[2,]~design[,2],method="jitter",pch=16,cex=0.7,col=c(4,2), group.names=c("Normal","Cancer"),ylab="M values",vertical=T,cex.axis=1.5,cex.lab=2) title("(A) Top DM CpG",cex.main=2) stripchart(egDMDV[1,]~design[,2],method="jitter",pch=16,cex=0.7,col=c(4,2), group.names=c("Normal","Cancer"),ylab="M values",vertical=T,cex.axis=1.5,cex.lab=2) title("(B) Top DV CpG",cex.main=2) par(mar=c(6,6,5,1)+0.1) z<-getLeveneResiduals(egDMDV,design,coef=2) barplot(z$data[2,],names="",col=c(rep(4,160),rep(2,283)),xlab="Samples", ylab="Absolute deviation",cex.lab=2,cex.axis=1.5,border=NA) text(80,3,labels="Normal tissue",col=4,cex=2) text(400,3,labels="Cancer tissue",col=2,cex=2) title("(C) Top DM CpG",cex.main=2) barplot(z$data[1,],names="",col=c(rep(4,160),rep(2,283)),xlab="Samples", ylab="Absolute deviation",cex.lab=2,cex.axis=1.5,border=NA) text(80,3,labels="Normal tissue",col=4,cex=2) text(380,3,labels="Cancer tissue",col=2,cex=2) title("(D) Top DV CpG",cex.main=2) 9 Example of awesome plot using the “Gviz” library Combines 3 different data types 236 lines of R code Made by Jovana Maksimovic 10 The R community Packages Melbourne Users of R Network Yearly conference Where to get help Slide from Joseph B Rickert 11 R is pretty awesome… (Except not so much for high dimensional data) 12 Example: what’s going on in a tumour? RNA sample Tumour (Lots of steps…) Count data Can this inform us on which drugs to give the patient? 20 000 rows 13 Does the expression level of a gene change between cancer and normal samples? Normal samples DNER 11.15 10.01 10.97 11.03 Cancer samples 11.49 5.53 4.31 3.39 3.51 3.98 Is this difference statistically significant? 14 The first command I typed: > ?t.test starting httpd help server ... done 15 t.test {stats} R Documentation Student's t-Test Description Performs one and two sample t-tests on vectors of data. Usage t.test(x, ...) ## Default S3 method: t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) ## S3 method for class 'formula' t.test(formula, data, subset, na.action, ...) Arguments x a (non-empty) numeric vector of data values. y an optional (non-empty) numeric vector of data values. alternative a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less". You can specify just the initial letter. mu a number indicating the true value of the mean (or difference in means if you are performing a two sample test). paired a logical indicating whether you want a paired t-test. var.equal a logical variable indicating whether to treat the two variances as being equal. 16 Everything in R is vectorised* (even scalars) > ?t.test starting httpd help server ... done > dner TCGA-B0-5709-11 TCGA-CW-5591-11 TCGA-CW-6087-11 TCGA-CW-5585-11 TCGA-CW-5589-11 11.150319 10.006868 10.970987 11.029617 11.492993 TCGA-B0-5695-01 TCGA-B0-4710-01 TCGA-B4-5836-01 TCGA-B2-4099-01 TCGA-B0-5083-01 5.528162 4.314067 3.393850 3.508280 3.981251 > group [1] Normal Normal Normal Normal Normal Cancer Cancer Cancer Cancer Cancer Levels: Cancer Normal * Another talk on it’s own 17 > t.test(dner~group) Welch Two Sample t-test data: dner by group t = -14.864, df = 6.8485, p-value = 1.828e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.869296 -5.700774 sample estimates: mean in group Cancer mean in group Normal 4.145122 10.930157 > t.test(dner[group=="Cancer"],dner[group=="Normal"]) 18 But I want to test 20 000 genes Cancer samples Normal samples Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9 Gene10 8.04 11.32 7.88 9.67 10.93 9.68 11.07 6.57 6.42 9.61 7.75 10.71 6.57 10.66 12.24 10.53 10.23 6.80 6.11 9.41 7.06 11.51 7.46 9.78 11.15 9.21 10.34 7.09 5.96 8.43 7.69 11.50 6.70 10.02 9.81 10.30 11.34 6.74 6.19 9.37 7.49 11.43 7.35 10.43 11.20 9.34 11.01 6.33 5.85 9.70 7.96 10.92 8.46 10.64 10.61 10.87 10.98 6.03 5.76 9.24 7.23 10.09 8.82 11.31 11.31 9.61 11.25 7.13 6.28 9.70 7.54 10.88 6.69 9.91 11.32 9.45 10.16 7.01 6.07 10.20 8.07 11.29 8.63 10.46 9.78 9.83 10.92 7.67 6.97 8.47 7.98 11.16 6.77 5.64 9.83 9.41 10.30 3.98 7.19 4.70 20 000 more rows 19 Many functions in base R are not designed for matrices > t.test(logCounts[1:10,]~group) Welch Two Sample t-test data: nC[1:10, ] by group t = -2.4845, df = 96.502, p-value = 0.0147 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.6840884 -0.1882764 sample estimates: mean in group Cancer mean in group Normal 8.456263 9.392445 20 You need to do some form of looping (there are a number of ways to do this) 21 > tstat <- rep(NA,10) > tstat [1] NA NA NA NA NA NA NA NA NA NA Set up empty vectors to store important statistics > Pval <- rep(NA,10) > Pval [1] NA NA NA NA NA NA NA NA NA NA > for(i in 1:10){ + out <- t.test(logCounts[i,]~group) + tstat[i] <- out$statistic + Pval[i] <- out$p.value + } > cbind(tstat,Pval) tstat Pval [1,] 0.65647357 0.5299499 [2,] -1.65581800 0.1402197 [3,] 1.28976536 0.2444178 [4,] -0.50571919 0.6380099 [5,] -0.96423394 0.3636724 [6,] 0.05651803 0.9563169 [7,] -0.25195820 0.8074337 [8,] -0.51537649 0.6316880 22 Statistical calculations are based on matrix algebra … and matrix calculations in R are a lot faster than running for loops 23 Matrix operations in R: * vs %*% > mat [,1] [,2] [,3] [,4] [,5] [1,] 1 2 3 4 5 [2,] 1 2 3 4 5 [3,] 1 2 3 4 5 [4,] 1 2 3 4 5 [5,] 1 2 3 4 5 > mat * mat > mat %*% mat [,1] [,2] [,3] [,4] [,5] [,1] [,2] [,3] [,4] [,5] [1,] 1 4 9 16 25 [1,] 15 30 45 60 75 [2,] 1 4 9 16 25 [2,] 15 30 45 60 75 [3,] 1 4 9 16 25 [3,] 15 30 45 60 75 [4,] 1 4 9 1660 25 75 [4,] 15 30 45 [5,] 1 4 9 1660 25 75 [5,] 15 30 45 > mat [,1] [,2] [,3] [,4] [,5] [1,] 1 2 3 4 5 [2,] 1 2 3 4 5 [3,] 1 2 3 4 5 [4,] 1 2 3 4 5 [5,] 1 2 3 4 5 Multiplies each Multiplies each row by element together each column and adds elements together 24 Why don’t R functions do this automatically? • This problem has not commonly been encountered in classical statistical applications • The R Core team is reluctant to make “drastic” changes to base R • The Bioconductor project started in 2001 to address the unique issues facing researchers in bioinformatics 25 Bioinformatics and R: Bioconductor “Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language” 26 First Bioconductor paper published in 2004 27 Goals of the Bioconductor project: • fostering collaborative development and widespread use of innovative software • reducing barriers to entry into interdisciplinary scientific research • promoting the achievement of remote reproducibility of research results 28 Bioconductor packages (as of a few days ago) 1. Software (n=1104) - Statistical methods 2. AnnotationData (n=895) - where does this gene come from in the genome? 3. ExperimentData (n=257) - packages that contain data used for illustrative purposes - e.g. datasets for books 29 Download stats for Bioconductor packages 30 Testing 20 000 genes is easy (and fast) with the right package Normal samples Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9 Gene10 8.04 11.32 7.88 9.67 10.93 9.68 11.07 6.57 6.42 9.61 7.75 10.71 6.57 10.66 12.24 10.53 10.23 6.80 6.11 9.41 7.06 11.51 7.46 9.78 11.15 9.21 10.34 7.09 5.96 8.43 7.69 11.50 6.70 10.02 9.81 10.30 11.34 6.74 6.19 9.37 Cancer samples 7.49 11.43 7.35 10.43 11.20 9.34 11.01 6.33 5.85 9.70 7.96 10.92 8.46 10.64 10.61 10.87 10.98 6.03 5.76 9.24 7.23 10.09 8.82 11.31 11.31 9.61 11.25 7.13 6.28 9.70 7.54 10.88 6.69 9.91 11.32 9.45 10.16 7.01 6.07 10.20 8.07 11.29 8.63 10.46 9.78 9.83 10.92 7.67 6.97 8.47 7.98 11.16 6.77 5.64 9.83 9.41 10.30 3.98 7.19 4.70 20 000 more rows 31 > > > > > library(limma) design <- model.matrix(~group) fit <- lmFit(logCounts,design) fit <- eBayes(fit,trend=TRUE) topTable(fit,coef=2) > design Int NormalVsCancer 1 1 1 2 1 1 3 1 1 4 1 1 5 1 1 6 1 0 7 1 0 8 1 0 9 1 0 10 1 0 > topTable(fit,coef=2) logFC AveExpr t P.Value adj.P.Val B OVCH2|341277 9.169723 -1.567496 10.074333 7.003360e-07 0.008273631 6.168942 SERPINA5|5104 8.864641 3.466592 9.381868 1.419199e-06 0.008273631 5.571076 BMP8A|353500 -3.050923 2.523338 -8.755693 2.788463e-06 0.008273631 4.985163 DDB2|1643 -2.404746 4.612741 -8.652805 3.126878e-06 0.008273631 4.884508 SGK2|10110 2.774716 4.149838 8.596655 3.330052e-06 0.008273631 4.829035 TFAP2A|7020 4.618654 2.034020 8.571366 3.426171e-06 0.008273631 4.803926 AP1M2|10053 3.060636 4.677211 8.278972 4.783436e-06 0.009096147 4.507849 ST6GALNAC2|10610 2.998590 2.368024 8.231874 5.051767e-06 0.009096147 4.459151 DNER|92737 6.813317 1.705048 8.109853 5.825411e-06 0.009096147 4.331658 LOC91316|91316 -2.373354 3.688923 -7.989076 6.718301e-06 0.009096147 4.203558 32 Getting a package into Bioconductor 33 Submitting your package • Every package goes through a curation process • Every package must meet certain standards to be accepted – http://www.bioconductor.org/developers/packageguidelines/ • Every package must have proper documentation – Help pages and user manual is compulsory 34 Every package must compile and build successfully on multiple platforms: • Linux • Windows • MacOS (I was getting so frustrated debugging my package I went out and bought a Mac.) 35 Benefits • Easy to install Bioconductor packages from within R: > source("https://bioconductor.org/biocLite.R") > biocLite("missMethyl") • Software packages are built daily, someone will tell you if your package has broken! • Bioconductor developers mailing list • Convenient to distribute your package • Publishing methods – a Bioconductor package gives you kudos 36 “How I decide when to trust an R package” - Simply Statistics, Jeff Leek Picture from Jeff Leek’s blog post 37 Pros which can also be cons • Anyone can submit a package – Coding style is mostly curated, BAD CODE IS NOT • Dependencies • It can take a while to get through the curation process and have your package accepted Top tip: write a package with a “buddy” 38 Bioconductor community • Yearly Bioconductor conference which highlights current developments – Workshops – Developer Day • Bioconductor courses – all materials available online: https://www.bioconductor.org/help/course-materials/ • Bioconductor is committed to open source – All licenses are either Artistic 2.0, GPL2 or BSD 39 Where to get help • Excellent support site 40 Summary • R and Bioconductor has played a pivotal role in shaping how Bioinformaticians analyse their data • Open source is the key to their success • There is a strong worldwide community of R and Bioconductor users and developers • You never stop learning about cool stuff in R! 41 Acknowledgements MCRI Bioinformatics - Alicia Oshlack - Simon Sadedin - Jovana Maksimovic - Harriet Dashnow - Anthony Hawkins MCRI Statistical Genetics - Ashley Farlow WEHI - Alan Rubin The internet 42