And Bioconductor Software Tools for Scientific Data Analysis

advertisement
Software Tools
for
Scientific Data Analysis
and Visualization:
And Bioconductor
Stowers Science Club
Earl F. Glynn
Scientific Programmer
Bioinformatics
20 May 2005
1
Topics
•
•
•
•
•
What is “R”?
What is “Bioconductor”?
Pros and Cons
How to get R and Bioconductor?
Why should a biologist care?
2
What is “R”?
•“Calculator” and Statistical Analysis Tool
large number of built-in statistical and math functions
•Exploratory Data Analysis Tool
descriptive statistics/graphics
•Graphics and Data Visualization Tool
high-quality, customizable graphics
•Huge Library of Specialty “Packages”
a growing number specifically for microarray data analysis
•Statistical Computing Language/Environment
derivative of “S” and “S-Plus” languages from early 1980s
3
R Example
R as a graphing “calculator”
log(Keq) Vs 1/T
3.5
> plot(1/T, log(Keq),
main = "log(Keq) Vs 1/T")
3.0
> Keq <- c(210, 73, 31, 16)
log(Keq)
> T <- c(478, 533, 588, 643)
4.0
4.5
5.0
Temperature dependence of
equilibrium constant
0.0016
0.0017
0.0018
0.0019
0.0020
0.0021
1/T
Data Source: Hill & Petrucci, General Chemistry (2nd ed), 1999, p. 756
4
R Example
R as a graphing “calculator”
> fit <- lm(log(Keq) ~ I(1/T))
> fit
5.0
log(Keq) Vs 1/T
> abline(fit, col="red")
> coefficients(fit)
(Intercept)
I(1/T)
-4.726161 4810.111351
> summary(fit)
4.0
3.5
4810.111
3.0
-4.726
I(1/T)
log(Keq)
Coefficients:
(Intercept)
4.5
Call:
lm(formula = log(Keq) ~ I(1/T))
0.0016
0.0017
0.0018
0.0019
0.0020
0.0021
1/T
5
R Example
R as a graphing “calculator”
Simple Model: log(Keq) = a + b/T
> fit <- lm(log(Keq) ~ I(1/T))
Complex Model: log(Keq) = a + b/T + c*log(T) + d*T + e*T2
> T <- c(200, 225, 273, 300, 325)
> Keq <- exp(1.2 + 0.3/T - 1.25*log(T) + 0.01*T - 0.0003*T^2)
> fit <- lm(log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2))
> fit
Call:
lm(formula = log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2))
Coefficients:
(Intercept)
1.2000
I(1/T)
0.3000
I(log(T))
-1.2500
T
0.0100
I(T^2)
-0.0003
6
R Example
Exploratory Data Analysis
•Use descriptive statistics to see “big picture” prior
to formal analysis to examine data quality
•Need techniques that are robust to outliers
§Measures of Center:
- Mean (normal distribution)
- Median (skewed distribution)
§Measures of Spread:
- Standard Deviation (SD) (appropriate with Mean)
standardize: (X – mean(X)) / sd(X)
- Median Absolution Deviation (MAD) (appropriate with Median)
standardize: (X – median(X)) / mad(X)
- Interquartile Range (appropriate with Median)
7
R Example
Exploratory Data Analysis
Tukey’s “Five Number” Summary
Min
7
Median
45
12
41
22
37
Max
84
48
79
57
29
Q1
“Lower Hinge”
73
65
Interquartile Range
(IQR)
Q3
“Upper Hinge”
> x <- c(79,73,7,12,29,22,65,84,45,41,48,57,37)
> fivenum(x)
[1] 7 29 45 65 84
Source: John W. Tukey, Exploratory Data Analysis, 1977.
8
R Example
Exploratory Data Analysis
40
60
80
Five Number Summary
Max
Q3
Median
20
Q1
> x <- c(79,73,7,12,
29,22,65,84,45,
41,48,57,37)
> boxplot(x, main=
″Five Number Summary″)
IQR
“box and whisker” plot
or simply a “boxplot”
Min
Visualize “five-number summary” with a boxplot:
Minimum, Quartile 1, Median, Quartile 3, Maximum9
R Example
Exploratory Data Analysis
> RawData <- read.csv("Complete_Dataset.csv", as.is=TRUE)
> Expression <- log2( data.matrix(RawData[,2:ncol(RawData)]))
> boxplot(data.frame(Expression),
main="Bozdech 'Complete' Plasmodium Dataset",
las=VERTICAL<-3, cex.axis=0.7,
ylab="Log2 Expression Ratio")
2
0
-2
-4
-6
TP1
TP2
TP3
TP4
TP5
TP6
TP7
TP8
TP9
TP10
TP11
TP12
TP13
TP14
TP15
TP16
TP17
TP18
TP19
TP20
TP21
TP22
TP23
TP24
TP25
TP26
TP27
TP28
TP29
TP30
TP31
TP32
TP33
TP34
TP35
TP36
TP37
TP38
TP39
TP40
TP41
TP42
TP43
TP44
TP45
TP46
TP47
TP48
-8
Log2 Expression Ratio
4
Bozdech 'Complete' Plasmodium Dataset
10
R Example
Exploratory Data Analysis
> # Use Bioconductor package
> library(arrayMagic)
> plot.imageMatrix (
Expression, yLabels="",
main="Log2 Gene Expression
in Plasmodium Dataset" )
11
R Example
Statistical Analysis:
Evaluate Gene Expression for Periodicity
> ShowSingleOligoProfileByName("i3518_1")
Time Interval Variability
30
0
-2
10
20
Frequency
0
-1
Expression
1
40
i3518_1
N = 46
20
30
40
-1.0
-0.5
0.0
0.5
log10(delta T)
Lomb-Scargle Periodogram
Period at Peak = 45.7 hours
Peak Significance
p = 1.48e-008 at Peak
1.0
Time [hours]
1.0
20
0.8
25
10
10
p = 0.001
0.6
0.0
5
p = 0.01
p = 0.05
0.4
p = 1e-04
0.2
p = 1e-05
Probability
15
p = 1e-06
0
Normalized Power Spectral Density
0
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
12
R Example
Statistical Analysis:
Evaluate Gene Expression for Periodicity
> ShowSingleOligoProfileByName("j167_5")
20
15
0
5
10
Frequency
0.5
0.0
-0.5
Expression
25
Time Interval Variability
1.0
j167_5
N = 35
20
30
40
-1.0
-0.5
0.0
0.5
log10(delta T)
Lomb-Scargle Periodogram
Period at Peak = 17.8 hours
Peak Significance
p = 0.998 at Peak
1.0
Time [hours]
1.0
20
0.8
25
10
p = 0.001
0.6
0.0
5
p = 0.01
p = 0.05
0.4
10
p = 1e-04
Probability
p = 1e-05
0.2
15
p = 1e-06
0
Normalized Power Spectral Density
0
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
13
R Example
Statistical Analysis:
Multiple Hypothesis Testing
M ultiple T esting Correction M ethods
-4
α = 0.0001
-6
Log10(p)
p.adjust function
in R “stats”
package
-2
0
(Using R's p.adjust methods)
-8
bonferroni
holm
hochberg
fdr
none
0
1000
2000
3000
4000
5000
6000
7000
Rank Order of Sorted p Values
fdr = Benjamini and Hochberg’s “False Discovery Rate” Method
14
R Example
Statistical Analysis:
Logic Regression
Where L1 and L2 are Boolean expressions.
Each L can be represented by logic tree.
Logic Tree:
L = (B ∧ C) ∨ A
Ruczinksi, et al, (2003), Logic Regression,
Journal of Computational and Graphical Statistics, 12(3), 475-511.
R Package: LogicRec
15
R Example
Data Visualization
> example(layout)
>
>
>
>
>
>
set.seed(19)
library(MASS)
x <- rnorm(50)
y <- rnorm(50)
d <- kde2d(x,y)
image(d,
col=terrain.colors(50))
-3
0
-2
-1
1
0
1
2
2
3
> contour(d,add=T)
3
-1
> library(scatterplot3d)
> example(scatterplot3d)
-2
2
scatterplot3d - 5
> set.seed(19)
> x <- matrix(rnorm(200),10,20)
> heatmap(x)
-2
5
1
4
2
10
90
85
80
75
70
65
60
8
10
12
14
16
6
5
2
Girth
18
20
22
2
16
11
3
14
19
7
8
20
10
6
13
15
9
16
1
12
18
17
4
9
1
Histogram of rnorm(100)
Height
3
0
> set.seed(19)
> hist(rnorm(100),freq=F)
> curve(dnorm(x), add=T,
col="blue")
10 20 30 40 50 60 70 80
7
Volume
8
-1
0.4
1
0.3
0
0.2
-1
0.1
-2
0.0
-3
-2
-1
0
1
2
3
R Example
Data Visualization
Customized Graphics
R plot
plot(x,y, col="red", type="o",
main="R plot", xaxt="n",
xlab="specimen",
ylab="concentration",
ylim=c(0,25))
15
10
5
delta <- 0.01 * diff(par("usr")[1:2])
segments(x,
y-error, x,
y+error)
segments(x-delta, y-error, x+delta, y-error)
segments(x-delta, y+error, x+delta, y+error)
0
concentration
20
25
subtitle
# plot with error bars
x <- c(1,2,3,4,5)
y <- c(15,9,NA,19,22)
error <- c(3, 4, 1, 2.5, 0.5)
C
AB
12
34
5
Mi
ss
in g
a
gN
on
ry L
Ve
specimen
me
XY
Z
names <c("ABC","12345","Missing","VeryLongName","XYZ")
text(x, par("usr")[3] –
0.01*diff(par("usr")[3:4]),
srt=30, adj=1, labels=names, xpd=TRUE)
mtext("subtitle")
17
R Example
Data Visualization
Graphics Notes
•R creates graphics as postscript, pdf, or in a variety of
other formats.
•In Windows, copy and paste graphics as “metafile” to
Word, PowerPoint, or other programs.
•In Windows, enable “History, Recording” in graphics
window: Use PageUp/PageDown to step through graphics.
•In Word, save as “Web page, filtered” to make web page
including GIF graphics with transparency.
18
~500 R Packages
http://cran.r-project.org/src/contrib/PACKAGES.html
Most packages deal with data analysis, statistics, and visualization.
Caution: Software quality varies. Validate first!
19
What is “Bioconductor”?
•Open Source Software for Bioinformatics
•Started in Fall 2001 at Harvard
•First Bioconductor Release in May 2002
•~100 R Packages
•Software categories:
-Analysis (e.g., “limma” linear models for microarrays)
-Annotation (e.g., “Data packages”)
-Database Interaction
-Graphics & User Interface (e.g., “limmaGUI”)
-Graphs
-Pre-processing
-Ontologies (tools for working with gene ontologies)
•Web: www.bioconductor.org
20
Bioconductor Example
Limma: linear models for microarrays
library(limma)
# Adapted from ?contrasts.fit
# Simulate gene expression data: 6 microarrays and 20000 genes
# with one gene differentially expressed in first 3 arrays.
# contrasts.fit: Given a linear model fit to microarray data,
# compute estimated coefficients and standard errors for a
# given set of contrasts.
set.seed(71)
M <- matrix(rnorm(20000*6,sd=0.3),20000,6)
M[1,1:3] <- M[1,1:3] + 2
# design matrix corresponds to oneway layout,
# columns are orthogonal
design <- cbind(First3Arrays=c(1,1,1,0,0,0),
Last3Arrays=c(0,0,0,1,1,1))
fit <- lmFit(M,design=design)
# Would like to consider original two estimates plus
# difference between first 3 and last 3 arrays
contrast.matrix <- cbind(First3=c(1,0),Last3=c(0,1),
"Last3-First3"=c(-1,1))
fit2 <- contrasts.fit(fit,contrast.matrix)
fit2 <- eBayes(fit2)
# large values of eb$t indicate differential expression
results <- classifyTestsF(fit2)
vennDiagram( vennCounts(results))
First3
72
Last3
53
13
19
1
14
23
Last3-First3 19805
21
R/Bioconductor
Pros
Cons
•Powerful analysis tools
•Command line processing;
Batch processing
•Graphics rich software
•Several revisions/year
•Fast (most tasks)
•Free and open source:
UNIX/Windows/Apple
•Strong user community
•Help via mailing list
•Can be quirky
•No “GUI”: Difficult to
interact with data
•Graphics poor documentation
•Several revisions/year
•Slow (processing huge datasets)
•“Correct” way to ask
“One of the most intimidating things about R
is the seeming endlessness of it.”
Paul E. Johnson, KU Political Science Dept, R-Help, 9 May 2000
www.ku.edu/~pauljohn/R/Rtips.html
22
How to get R and Bioconductor?
http://bioinfo
23
How to get R and Bioconductor?
http://bioinfo/software/R.htm
...
...
24
Resources
Comprehensive R Archive Network (CRAN)
http://cran.r-project.org
R for Bioinformatics
Nov 2005?
SummeR Sessions?
25
Why should a biologist care?
•Excel has many limitations.
•R can serve as powerful graphing
“calculator.”
•R can easily work with vectors and
matrices with microarray data.
•State of the art analysis software
often introduced in published papers
using R.
26
Acknowledgements
Bioinformatics
Arcady Mushegian
Amy Ubben Admin
Research
Jie Chen Visiting Scientist
Frank Emmert-Streib
Galina Glasko
Manisha Goel
Piotr Kozbial
Jing Liu
us
Director
Support
Mike Coleman Scientific Programmer
Malcolm Cook Database Applications
Dan Thomasset UNIX Admin (IT)
27
Acknowledgements
Microarrays
Chris Seidel & Karen Zueckert-Gaudenz
Pourquié Lab
Mary-Lee Dequeant & Olivier Pourquié
maps.google.com
28
Download