PPT file

advertisement
BioConductor
Steffen Durinck
Robert Gentleman
Sandrine Dudoit
November 28, 2003
NETTAB Bologna
Outline
•
•
•
•
what is R
what is Bioconductor
packages
getting and using Bioconductor
R
• R is a language and environment for
statistical computing and graphics. It is a
GNU project which is similar to the S
language and environment which was
developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John
Chambers and colleagues. R can be
considered as a different implementation
of S.
R
• what sorts of things is R good at?
– there are very many statistical algorithms
– there are very many machine learning
algorithms
– visualization
– it is possible to write scripts that can be
reused
– R is a real computer language
R
• R supports many data technologies
– XML,database integration,SOAP
• R interacts with other languages
– C; FORTRAN; Perl; Python; Java
• R has good visualization capabilities
• R has a very active development
environment
• R is largely platform independent
– Unix; Windows; OSX
Overview of the
Bioconductor Project
Bioconductor
• Bioconductor is an open source and open
development software project for the analysis of
biomedical and genomic data.
• The project was started in the Fall of 2001 and
includes 23 core developers in the US, Europe,
and Australia.
• R and the R package system are used to design
and distribute software.
• Releases
–
–
–
–
v 1.0:
v 1.1:
v 1.2:
v 1.3:
May 2nd, 2002,
November 18th, 2002,
May 28th, 2003,
October 28th, 2003,
15 packages.
20 packages.
30 packages.
54 packages.
• ArrayAnalyzer: Commercial port of Bioconductor
packages in S-Plus.
Goals
• Provide access to powerful statistical and
graphical methods for the analysis of genomic
data.
• Facilitate the integration of biological metadata
(GenBank, GO, LocusLink, PubMed) in the
analysis of experimental data.
• Allow the rapid development of extensible,
interoperable, and scalable software.
• Promote high-quality documentation and
reproducible research.
• Provide training in computational and statistical
methods.
Bioconductor Packages
Bioconductor packages
• Bioconductor software consists of R add-on
packages.
• An R package is a structured collection of code
(R, C, or other), documentation, and/or data for
performing specific types of analyses.
• E.g. affy, cluster, graph, hexbin packages
provide implementations of specialized statistical
and graphical methods.
Bioconductor packages
Release 1.3, October 28th, 2003
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
AnnBuilder Bioconductor annotation data package builder
Biobase Biobase: Base functions for Bioconductor
DynDoc Dynamic document tools
MAGEML handling MAGEML documents
MeasurementError.cor Measurement Error model estimate for correlation coefficient
RBGL Test interface to boost C++ graph lib
ROC utilities for ROC, with uarray focus
RdbiPgSQL PostgreSQL access
Rdbi Generic database methods
Rgraphviz Provides plotting capabilities for R graph objects
Ruuid Ruuid: Provides Universally Unique ID values
SAGElyzer A package that deals with SAGE libraries
SNPtools Rudimentary structures for SNP data
affyPLM affyPLM - Probe Level Models
Affy Methods for Affymetrix Oligonucleotide Arrays
Affycomp Graphics Toolbox for Assessment of Affymetrix Expression Measures
Affydata Affymetrix Data for Demonstration Purpose
Annaffy Annotation tools for Affymetrix biological metadata
Annotate Annotation for microarrays
Bioconductor packages
Release 1.3, October 28th, 2003
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Ctc Cluster and Tree Conversion.
daMA Efficient design and analysis of factorial two-colour microarray data
Edd expression density diagnostics
externalVector Vector objects for R with external storage
factDesign Factorial designed microarray experiment analysis
Gcrma Background Adjustment Using Sequence Information
Genefilter Genefilter: filter genes
Geneplotter Geneplotter: plot microarray data
Globaltest Global Test
Gpls Classification using generalized partial least squares
Graph graph: A package to handle graph data structures
Hexbin Hexagonal Binning Routines
Limma Linear Models for Microarray Data
Makecdfenv CDF Environment Maker
marrayClasses Classes and methods for cDNA microarray data
marrayInput Data input for cDNA microarrays
marrayNorm Location and scale normalization for cDNA microarray data
marrayPlots Diagnostic plots for cDNA microarray data
marrayTools Miscellaneous functions for cDNA microarrays
Bioconductor packages
Release 1.3, October 28th, 2003
•
•
•
•
•
•
•
•
•
•
•
Matchprobes Tools for sequence matching of probes on arrays
Multtest Multiple Testing Procedures
ontoTools graphs and sparse matrices for working with ontologies
Pamr Pam: prediction analysis for microarrays
reposTools Repository tools for R
Rhdf5 An HDF5 interface for R
Siggenes Significance and Empirical Bayes Analyses of Microarrays
Splicegear splicegear
tkWidgets R based tk widgets
Vsn Variance stabilization and calibration for microarray data
widgetTools Creates an interactive tcltk widgets
Microarray data analysis
.gpr, .Spot, MAGEML
CEL, CDF
Pre-processing
marray
limma
vsn
affy
vsn
exprSet
Annotation
Differential
expression
Graphs &
networks
edd
genefilter
limma
multtest
ROC
+ CRAN
graph
RBGL
Rgraphviz
Cluster
analysis
CRAN
class
cluster
MASS
mva
Prediction
CRAN
class
e1071
ipred
LogitBoost
MASS
nnet
randomForest
rpart
annotate
annaffy
+ metadata
packages
Graphics
geneplotter
hexbin
+ CRAN
marray packages
Pre-processing two-color spotted array data:
• diagnostic plots,
• robust adaptive normalization (lowess, loess).
maImage
maBoxplot
maPlot + hexbin
affy package
Pre-processing oligonucleotide chip data:
• diagnostic plots,
• background correction,
• probe-level normalization,
• computation of expression measures.
plotAffyRNADeg
barplot.ProbeSet
image
plotDensity
annotate, annafy, and
AnnBuilder
Metadata package hgu95av2
mappings between different gene
identifiers for hgu95av2 chip.
• Assemble and process
genomic annotation data from
public repositories.
GENENAME
• Build annotation data
LOCUSID
zinc finger protein 261
packages or XML data
9203
documents.
ACCNUM
• Associate experimental data
in real time to biological
X95808
MAP
metadata from web databases
Xq13.1
AffyID
such as GenBank, GO,
41046_s_at
KEGG, LocusLink, and
PubMed.
• Process and store query
results: e.g., search PubMed
SYMBOL
abstracts.
ZNF261
• Generate HTML reports of
PMID
analyses.
GO
10486218
9205841
8817323
GO:0003677
GO:0007275
GO:0016021 + many other mappings
MAGEML package
<!DOCTYPE MAGE-ML SYSTEM "D:/DATA/MAGEML/MAGE-ML.dtd">
<MAGE-ML identifier="MAGE-ML:E-SNGR-4">
<QuantitationTypeDimension_assnlist>
marray packages
<QuantitationTypeDimension identifier="QTD:1">
<QuantitationTypes_assnreflist>
<MeasuredSignal_ref identifier="QT:F635 Median"/>
<MeasuredSignal_ref identifier="QT:F635 Mean"/>
….
(cDNA arrays)
SIGGENES PACKAGE - SAM
Delta vs. Significant Genes
4000
3000
500 1000
2000
25
20
15
10
5
0
0
FDR (in %)
30
number of significant genes
35
40
45
5000
50
Delta vs. FDR
0.2
0.6
1.0
delta
1.4
1.8
0.2
0.6
1.0
delta
1.4
1.8
multtest
package
• Multiple hypothesis testing
• Control type I error rate by using e.g.
Bonferroni method
mva package -clustering
heatmap
mva package – principal
component analysis
Getting started
Installation
1. Main R software: download from CRAN
(cran.r-project.org), use latest release,
now 1.8.0.
2. Bioconductor packages: download from
Bioconductor (www.bioconductor.org),
use latest release, now 1.3.
Available for Linux/Unix, Windows, and
Mac OS.
Installation
• After installing R, install Bioconductor
packages using getBioC install script.
• From R
> source("http://www.bioconductor.org/getBioC.R")
> getBioC()
• In general, R packages can be installed
using the function install.packages.
• In Windows, can also use “Packages” pulldown menus.
User interaction
• R Command-line
• Widgets. Small-scale graphical user
interfaces (GUI), providing point & click
access for specific tasks.
– E.g. File browsing and selection for data
input, basic analyses.
Widgets
Reading in phenoData
tkSampleNames
tkphenoData
tkMIAME
Documentation and help
• R manuals and tutorials:available from the R website or
on-line in an R session.
• R on-line help system: detailed on-line documentation,
available in text, HTML, PDF, and LaTeX formats.
> help.start()
> help(lm)
> ?hclust
> apropos(mean)
> example(hclust)
> demo()
> demo(image)
Short courses
• Bioconductor short courses
– modular training segments on software and
statistical methodology;
– lectures notes, computer labs, and course
packages available on WWW for selfinstruction.
Vignettes
• Bioconductor has adopted a new documentation
paradigm, the vignette.
• A vignette is an executable document consisting of a
collection of code chunks and documentation text
chunks.
• Vignettes provide dynamic, integrated, and reproducible
statistical documents that can be automatically updated
if either data or analyses are changed.
• Each Bioconductor package contains at least
one vignette, providing task-oriented
descriptions of the package's functionality.
Vignettes
• HowTo’s: Task-oriented
descriptions of package functionality.
• Executable documents consisting of
documentation text and code chunks.
• Dynamic, integrated, and
reproducible statistical documents.
• Can be used interactively –
vExplorer.
• Generated using Sweave (tools
package).
vExplorer
References
• R www.r-project.org, cran.r-project.org
–
–
–
–
software (CRAN);
documentation;
newsletter: R News;
mailing list.
• Bioconductor www.bioconductor.org
– software, data, and documentation (vignettes);
– training materials from short courses;
– mailing list.
• Personal
– sdurinck@esat.kuleuven.ac.be
acknowledgements
• Robert Gentleman
Department of Biostatistical Science, Dana Faber
Cancer Institute, Boston
• Sandrine Dudoit
Division Biostatistics, University of California, Berkeley
Download