PPT - University of Pennsylvania

advertisement
Making Sense out of Flow Cytometry Data
Overload
A crash course in R/Bioconductor and flow cytometry
fingerprinting
© 2010 by University of Pennsylvania School of Medicine
Outline
• Background
 R
 Bioconductor
•
•
•
•
Motivating examples
Starting R, entering commands
How to get help
R fundamentals





Sequences and Repeats
Characters and Numbers
Vectors and Matrices
Data Frames and Lists
Importing data from spreadsheets
• flowCore





Loading flow cytometry (FCS) data
gating
compensation
transformation
visualization
• flowFP
 Binning
 Fingerprinting
 Comparing multivariate distributions
• Writing your own functions
• Installing and running R on your
computer
• Suggestions for further reading
and reference
Background
•R
 Is an integrated suite of software facilities for data manipulation,
simulation, calculation and graphical display.
 It handles and analyzes data very effectively and it contains a suite of
operators for calculations on arrays and matrices.
 In addition, it has the graphical capabilities for very sophisticated graphs
and data displays.
 It is an elegant, object-oriented programming language.
 Started by Robert Gentleman and Ross Ihaka (hence “R”) in 1995
 as a free, independent, open-source implementation of the S
programming language (now part of Spotfire)
 Currently, maintained by the R Core development team – an
international group of hard-working volunteer developers
http://www.r-project.org
http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
Background
• Bioconductor
 “Is an open source and open development software project to provide
tools for the analysis and comprehension of genomic data.”
 Goals
 To provide widespread access to a broad range of powerful
statistical and graphical methods for the analysis of genomic data.
 To provide a common software platform that enables the rapid
development and deployment of extensible, scalable, and
interoperable software.
 To further scientific understanding by producing high-quality
documentation and reproducible research.
 To train researchers on computational and statistical methods for the
analysis of genomic data.
http://bioconductor.org/overview
A motivating example
I’ve just collected data from a T cell stimulation experiment in a 96-well
plate format. I need to gate the data on CD3/CD4. How consistent are
the distributions, so that I can establish one set of gates for the whole
plate?
A motivating example
Another motivating example
I’m concerned that drawing gates to analyze my data introduces
unintended bias. Additionally, since I have multiple data files,
drawing multiple gates is time consuming. Can I use R to compute
gates and then apply these same objective gating criteria to multiple
data files?
Another motivating example
Autogate lymphocytes
and monocytes
Automatically analyze
FMO tubes
Back to the basics
• R is a command-line driven
program
 the prompt is: >
 you type a command
(shown in blue), and R
executes the command
and gives the answer
(shown in black)
Simple example: enter a set of measurements
• use the function c() to combine terms together
• Create a variable named mfi
• Put the result of c() into mfi using the
assignment operator <- (you can also use =)
• The [1] indicates that the result is a vector
Help, functions, polymorphism
> help (log)
> ?log
> apropos(“log”)
Vignettes – really good help!
Sequences and Repeats
Characters and Numbers
• Characters and character strings are enclosed in “” or ‘’
• Special numbers
•
•
•
NA – “Not Available”
Inf – “Infinity”
NaN – “Not a Number”
Vectors and Matrices
Vectors and Matrices
• The subset operator for vectors and matrices is [ ]
Vectors and Matrices
• You can extend the length of a vector via subsetting
… but not a matrix
Vectors and Matrices
• However, all’s not lost if you want to extend either the columns …
… or rows
Data Frames
• A Data Frame is like a matrix, except that the data type in each
column need not be the same
 Often, a Data Frame is created from an Excel spreadsheet using the
function read.table()
Save As…
a tab-delimited
text file.
Data Frames from spreadsheets
Data Frames from spreadsheets
Data Frames from spreadsheets
Lists
Handling Flow Cytometry Data: flowCore
• flowCore is a base package that supports reading and manipulation
of FCS data files
• The fundamental object that encapsulates the data in an FCS file is
a flowFrame
• A container object that holds a collection of flowFrames is called a
flowSet
• In the next slides we will go over





reading an FCS file
gating
compensation
transformation
visualization
Check out the example data
Read an FCS file, summarize the flowFrame
Apply the lymphocyte gate with Subset
needs to be
transformed because
it is rendering the linear data
in the FCS file
hasn’t been compensated!
• Lines require library(fields)
• Percentages are in
summary(fres)$p[1:4]
• Percentages are drawn in
the graph with text()
Fingerprinting Flow Cytometry Data: flowFP
• flowFP
 aims to transform flow cytometric data into a form amenable to
algorithmic analysis tools
 Acts as in intermediate step between acquisition of high-throughput
FCM data and empirical modeling, machine learning and knowledge
discovery
 Implements ideas from
Roederer M, Moore W, Treister A, Hardy RR & Herzenberg LA. Probability binning comparison:
a metric for quantitating multivariate distribution differences. Cytometry 45:47-55, 2001.
and
Rogers WT, Moser AR, Holyst HA, Bantly A, Mohler ER III, Scangas G, and Moore JS, Cytometric
Fingerprinting: Quantitative Characterization of Multivariate Distributions, Cytometry 73A: 430-441,
2008.
The basic idea
• Subdivide multivariate space into bins
 Call this a “model” of the space
• For each flowFrame in a flowSet, count the number of events in each
bin in the model
• Flatten the collection of counts for a flowFrame into a 1D feature
vector
• Combine all of the feature vectors together into a n x m matrix
 n = number of flowFrames (instances)
 m = number of bins in the model (features)
• Also, tag each event with its bin membership
 facilitates visualization, interpretation
 can be used for gating
Probability Binning
Probability Binning
Probability Binning
Probability Binning
Bin Number
> plot (mod, fs)
Class Constructors
• flowFPModel (base class)
 Consumes a flowFrame or flowSet
 Produces a model, which is a recipe for subdividing multivariate space
• flowFP
 Consumes a flowFrame or flowSet, and a flowFPModel
 Produces a flowFP, which represents the multivariate probability density
function as a fingerprint
 Also tags each event with its bin membership
• flowFPPlex
 Consumes a collection of flowFPs
 The flowFPPlex is a container object to facilitate handling large and
complex collections of flowFPs
Writing Your Own Functions
#
# It’s a good idea to comment your code
#
myfunc <- function (arg1=10, arg2, ...)
{
# your code goes here
answer <- log (arg1, base=arg2)
comments
declaration
assignment
code block
return
return (answer)
}
Writing Your Own Functions
Obtaining R and Bioconductor
•R
 http://cran.r-project.org/
• Bioconductor
 http://bioconductor.org/GettingStarted
General Reference Material
• A good beginner’s guide to R
 http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
• A nice one-page reference card
 http://cran.r-project.org/doc/contrib/Short-refcard.pdf
• Outstanding summary of R/Bioconductor, with many examples
 http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#R_favor
ite
• The definitive reference for writing R extensions (advanced!)
 http://cran.r-project.org/doc/manuals/R-exts.pdf
• Books
 William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth Edition.
Springer, New York, 2002. ISBN 0-387-95457-0.
 John M. Chambers. Programming with Data. Springer, New York, 1998. ISBN 0-387-98503-4
(aka “the Green Book”)
Flow-Specific References
• Vignettes
 http://bioconductor.org/packages/2.6/bioc/vignettes/flowCore/inst/doc/HowTo-flowCore.pdf
 http://bioconductor.org/packages/2.6/bioc/vignettes/flowViz/inst/doc/filters.pdf
 http://bioconductor.org/packages/2.6/bioc/vignettes/flowStats/inst/doc/GettingStartedWithFlo
wStats.pdf
 http://bioconductor.org/packages/2.6/bioc/vignettes/flowQ/inst/doc/DataQualityAssessment.p
df
 http://bioconductor.org/packages/2.6/bioc/vignettes/flowFP/inst/doc/flowFP_HowTo.pdf
• Original Articles
 flowCore
 Hahne, F., N. LeMeur, et al. (2009). "flowCore: a Bioconductor package for high
throughput flow cytometry." BMC Bioinformatics 10: 106.
 Fingerprinting
 Rogers, W. T., A. R. Moser, et al. (2008). "Cytometric fingerprinting: quantitative
characterization of multivariate distributions." Cytometry A 73(5): 430-41.
 Rogers, W. T. and H. A. Holyst (2009). "flowFP: A Bioconductor Package for
Fingerprinting Flow Cytometric Data." Advances in Bioinformatics 2009(Article ID
193947): 11.
Contact Me!
Wade Rogers
rogersw@mail.med.upenn.edu
267-350-9680 (o)
610-368-5821 (m)
Download