Software Tools for Scientific Data Analysis and Visualization: And Bioconductor Stowers Science Club Earl F. Glynn Scientific Programmer Bioinformatics 20 May 2005 1 Topics • • • • • What is “R”? What is “Bioconductor”? Pros and Cons How to get R and Bioconductor? Why should a biologist care? 2 What is “R”? •“Calculator” and Statistical Analysis Tool large number of built-in statistical and math functions •Exploratory Data Analysis Tool descriptive statistics/graphics •Graphics and Data Visualization Tool high-quality, customizable graphics •Huge Library of Specialty “Packages” a growing number specifically for microarray data analysis •Statistical Computing Language/Environment derivative of “S” and “S-Plus” languages from early 1980s 3 R Example R as a graphing “calculator” log(Keq) Vs 1/T 3.5 > plot(1/T, log(Keq), main = "log(Keq) Vs 1/T") 3.0 > Keq <- c(210, 73, 31, 16) log(Keq) > T <- c(478, 533, 588, 643) 4.0 4.5 5.0 Temperature dependence of equilibrium constant 0.0016 0.0017 0.0018 0.0019 0.0020 0.0021 1/T Data Source: Hill & Petrucci, General Chemistry (2nd ed), 1999, p. 756 4 R Example R as a graphing “calculator” > fit <- lm(log(Keq) ~ I(1/T)) > fit 5.0 log(Keq) Vs 1/T > abline(fit, col="red") > coefficients(fit) (Intercept) I(1/T) -4.726161 4810.111351 > summary(fit) 4.0 3.5 4810.111 3.0 -4.726 I(1/T) log(Keq) Coefficients: (Intercept) 4.5 Call: lm(formula = log(Keq) ~ I(1/T)) 0.0016 0.0017 0.0018 0.0019 0.0020 0.0021 1/T 5 R Example R as a graphing “calculator” Simple Model: log(Keq) = a + b/T > fit <- lm(log(Keq) ~ I(1/T)) Complex Model: log(Keq) = a + b/T + c*log(T) + d*T + e*T2 > T <- c(200, 225, 273, 300, 325) > Keq <- exp(1.2 + 0.3/T - 1.25*log(T) + 0.01*T - 0.0003*T^2) > fit <- lm(log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2)) > fit Call: lm(formula = log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2)) Coefficients: (Intercept) 1.2000 I(1/T) 0.3000 I(log(T)) -1.2500 T 0.0100 I(T^2) -0.0003 6 R Example Exploratory Data Analysis •Use descriptive statistics to see “big picture” prior to formal analysis to examine data quality •Need techniques that are robust to outliers §Measures of Center: - Mean (normal distribution) - Median (skewed distribution) §Measures of Spread: - Standard Deviation (SD) (appropriate with Mean) standardize: (X – mean(X)) / sd(X) - Median Absolution Deviation (MAD) (appropriate with Median) standardize: (X – median(X)) / mad(X) - Interquartile Range (appropriate with Median) 7 R Example Exploratory Data Analysis Tukey’s “Five Number” Summary Min 7 Median 45 12 41 22 37 Max 84 48 79 57 29 Q1 “Lower Hinge” 73 65 Interquartile Range (IQR) Q3 “Upper Hinge” > x <- c(79,73,7,12,29,22,65,84,45,41,48,57,37) > fivenum(x) [1] 7 29 45 65 84 Source: John W. Tukey, Exploratory Data Analysis, 1977. 8 R Example Exploratory Data Analysis 40 60 80 Five Number Summary Max Q3 Median 20 Q1 > x <- c(79,73,7,12, 29,22,65,84,45, 41,48,57,37) > boxplot(x, main= ″Five Number Summary″) IQR “box and whisker” plot or simply a “boxplot” Min Visualize “five-number summary” with a boxplot: Minimum, Quartile 1, Median, Quartile 3, Maximum9 R Example Exploratory Data Analysis > RawData <- read.csv("Complete_Dataset.csv", as.is=TRUE) > Expression <- log2( data.matrix(RawData[,2:ncol(RawData)])) > boxplot(data.frame(Expression), main="Bozdech 'Complete' Plasmodium Dataset", las=VERTICAL<-3, cex.axis=0.7, ylab="Log2 Expression Ratio") 2 0 -2 -4 -6 TP1 TP2 TP3 TP4 TP5 TP6 TP7 TP8 TP9 TP10 TP11 TP12 TP13 TP14 TP15 TP16 TP17 TP18 TP19 TP20 TP21 TP22 TP23 TP24 TP25 TP26 TP27 TP28 TP29 TP30 TP31 TP32 TP33 TP34 TP35 TP36 TP37 TP38 TP39 TP40 TP41 TP42 TP43 TP44 TP45 TP46 TP47 TP48 -8 Log2 Expression Ratio 4 Bozdech 'Complete' Plasmodium Dataset 10 R Example Exploratory Data Analysis > # Use Bioconductor package > library(arrayMagic) > plot.imageMatrix ( Expression, yLabels="", main="Log2 Gene Expression in Plasmodium Dataset" ) 11 R Example Statistical Analysis: Evaluate Gene Expression for Periodicity > ShowSingleOligoProfileByName("i3518_1") Time Interval Variability 30 0 -2 10 20 Frequency 0 -1 Expression 1 40 i3518_1 N = 46 20 30 40 -1.0 -0.5 0.0 0.5 log10(delta T) Lomb-Scargle Periodogram Period at Peak = 45.7 hours Peak Significance p = 1.48e-008 at Peak 1.0 Time [hours] 1.0 20 0.8 25 10 10 p = 0.001 0.6 0.0 5 p = 0.01 p = 0.05 0.4 p = 1e-04 0.2 p = 1e-05 Probability 15 p = 1e-06 0 Normalized Power Spectral Density 0 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 12 R Example Statistical Analysis: Evaluate Gene Expression for Periodicity > ShowSingleOligoProfileByName("j167_5") 20 15 0 5 10 Frequency 0.5 0.0 -0.5 Expression 25 Time Interval Variability 1.0 j167_5 N = 35 20 30 40 -1.0 -0.5 0.0 0.5 log10(delta T) Lomb-Scargle Periodogram Period at Peak = 17.8 hours Peak Significance p = 0.998 at Peak 1.0 Time [hours] 1.0 20 0.8 25 10 p = 0.001 0.6 0.0 5 p = 0.01 p = 0.05 0.4 10 p = 1e-04 Probability p = 1e-05 0.2 15 p = 1e-06 0 Normalized Power Spectral Density 0 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 13 R Example Statistical Analysis: Multiple Hypothesis Testing M ultiple T esting Correction M ethods -4 α = 0.0001 -6 Log10(p) p.adjust function in R “stats” package -2 0 (Using R's p.adjust methods) -8 bonferroni holm hochberg fdr none 0 1000 2000 3000 4000 5000 6000 7000 Rank Order of Sorted p Values fdr = Benjamini and Hochberg’s “False Discovery Rate” Method 14 R Example Statistical Analysis: Logic Regression Where L1 and L2 are Boolean expressions. Each L can be represented by logic tree. Logic Tree: L = (B ∧ C) ∨ A Ruczinksi, et al, (2003), Logic Regression, Journal of Computational and Graphical Statistics, 12(3), 475-511. R Package: LogicRec 15 R Example Data Visualization > example(layout) > > > > > > set.seed(19) library(MASS) x <- rnorm(50) y <- rnorm(50) d <- kde2d(x,y) image(d, col=terrain.colors(50)) -3 0 -2 -1 1 0 1 2 2 3 > contour(d,add=T) 3 -1 > library(scatterplot3d) > example(scatterplot3d) -2 2 scatterplot3d - 5 > set.seed(19) > x <- matrix(rnorm(200),10,20) > heatmap(x) -2 5 1 4 2 10 90 85 80 75 70 65 60 8 10 12 14 16 6 5 2 Girth 18 20 22 2 16 11 3 14 19 7 8 20 10 6 13 15 9 16 1 12 18 17 4 9 1 Histogram of rnorm(100) Height 3 0 > set.seed(19) > hist(rnorm(100),freq=F) > curve(dnorm(x), add=T, col="blue") 10 20 30 40 50 60 70 80 7 Volume 8 -1 0.4 1 0.3 0 0.2 -1 0.1 -2 0.0 -3 -2 -1 0 1 2 3 R Example Data Visualization Customized Graphics R plot plot(x,y, col="red", type="o", main="R plot", xaxt="n", xlab="specimen", ylab="concentration", ylim=c(0,25)) 15 10 5 delta <- 0.01 * diff(par("usr")[1:2]) segments(x, y-error, x, y+error) segments(x-delta, y-error, x+delta, y-error) segments(x-delta, y+error, x+delta, y+error) 0 concentration 20 25 subtitle # plot with error bars x <- c(1,2,3,4,5) y <- c(15,9,NA,19,22) error <- c(3, 4, 1, 2.5, 0.5) C AB 12 34 5 Mi ss in g a gN on ry L Ve specimen me XY Z names <c("ABC","12345","Missing","VeryLongName","XYZ") text(x, par("usr")[3] – 0.01*diff(par("usr")[3:4]), srt=30, adj=1, labels=names, xpd=TRUE) mtext("subtitle") 17 R Example Data Visualization Graphics Notes •R creates graphics as postscript, pdf, or in a variety of other formats. •In Windows, copy and paste graphics as “metafile” to Word, PowerPoint, or other programs. •In Windows, enable “History, Recording” in graphics window: Use PageUp/PageDown to step through graphics. •In Word, save as “Web page, filtered” to make web page including GIF graphics with transparency. 18 ~500 R Packages http://cran.r-project.org/src/contrib/PACKAGES.html Most packages deal with data analysis, statistics, and visualization. Caution: Software quality varies. Validate first! 19 What is “Bioconductor”? •Open Source Software for Bioinformatics •Started in Fall 2001 at Harvard •First Bioconductor Release in May 2002 •~100 R Packages •Software categories: -Analysis (e.g., “limma” linear models for microarrays) -Annotation (e.g., “Data packages”) -Database Interaction -Graphics & User Interface (e.g., “limmaGUI”) -Graphs -Pre-processing -Ontologies (tools for working with gene ontologies) •Web: www.bioconductor.org 20 Bioconductor Example Limma: linear models for microarrays library(limma) # Adapted from ?contrasts.fit # Simulate gene expression data: 6 microarrays and 20000 genes # with one gene differentially expressed in first 3 arrays. # contrasts.fit: Given a linear model fit to microarray data, # compute estimated coefficients and standard errors for a # given set of contrasts. set.seed(71) M <- matrix(rnorm(20000*6,sd=0.3),20000,6) M[1,1:3] <- M[1,1:3] + 2 # design matrix corresponds to oneway layout, # columns are orthogonal design <- cbind(First3Arrays=c(1,1,1,0,0,0), Last3Arrays=c(0,0,0,1,1,1)) fit <- lmFit(M,design=design) # Would like to consider original two estimates plus # difference between first 3 and last 3 arrays contrast.matrix <- cbind(First3=c(1,0),Last3=c(0,1), "Last3-First3"=c(-1,1)) fit2 <- contrasts.fit(fit,contrast.matrix) fit2 <- eBayes(fit2) # large values of eb$t indicate differential expression results <- classifyTestsF(fit2) vennDiagram( vennCounts(results)) First3 72 Last3 53 13 19 1 14 23 Last3-First3 19805 21 R/Bioconductor Pros Cons •Powerful analysis tools •Command line processing; Batch processing •Graphics rich software •Several revisions/year •Fast (most tasks) •Free and open source: UNIX/Windows/Apple •Strong user community •Help via mailing list •Can be quirky •No “GUI”: Difficult to interact with data •Graphics poor documentation •Several revisions/year •Slow (processing huge datasets) •“Correct” way to ask “One of the most intimidating things about R is the seeming endlessness of it.” Paul E. Johnson, KU Political Science Dept, R-Help, 9 May 2000 www.ku.edu/~pauljohn/R/Rtips.html 22 How to get R and Bioconductor? http://bioinfo 23 How to get R and Bioconductor? http://bioinfo/software/R.htm ... ... 24 Resources Comprehensive R Archive Network (CRAN) http://cran.r-project.org R for Bioinformatics Nov 2005? SummeR Sessions? 25 Why should a biologist care? •Excel has many limitations. •R can serve as powerful graphing “calculator.” •R can easily work with vectors and matrices with microarray data. •State of the art analysis software often introduced in published papers using R. 26 Acknowledgements Bioinformatics Arcady Mushegian Amy Ubben Admin Research Jie Chen Visiting Scientist Frank Emmert-Streib Galina Glasko Manisha Goel Piotr Kozbial Jing Liu us Director Support Mike Coleman Scientific Programmer Malcolm Cook Database Applications Dan Thomasset UNIX Admin (IT) 27 Acknowledgements Microarrays Chris Seidel & Karen Zueckert-Gaudenz Pourquié Lab Mary-Lee Dequeant & Olivier Pourquié maps.google.com 28