March 20 - Mouse Genome Informatics

advertisement
Differential Gene Expression with
the limma package
20 March 2012
Functional Genomics
Linear regression
• Fit a straight line through a set of points such
that the distance from the points to the line is
minimized
The slope of the line is
adjusted to minimize the
squares of the vertical
distance of the points from
the line.
The line represents the
model, the distances
between the points and the
line are the residuals.
The simple regression
minimizes the sum of the
squares of the residuals…this
is the method of least
squares.
Y = Y0 + β Z
Assume you have a data set of gene
expression in tumor vs normal
tissue.
This is a simple mathematical
expression of what is being
calculated for a linear model.
Y is expression of gene X
Y0 is mean expression of normal tissue, t
β is difference of expression of normal, compared to tumor, tissue
Z is group variable (0 for normal; 1 for tissue)
Multivariate linear regression
Y = Y0 + β Z + ϒ W
Suppose you have another
variable…such as age…you can add
that right in!
Y is expression of gene X
Y0 is mean expression of normal t
β is difference of expression of normal, compared to tumor, tissue
Z is group variable (0 for normal; 1 for tissue)
ϒ = age affect
W = age group
Multivariate linear regression
Y = Y0 + β Z + ϒW + δZ*W
You can ask for differences in gene
expression due to tissue, due to age,
and due to an age by tissue
interaction.
Y is expression of gene X
Y0 is mean expression of normal t
β is difference of expression of normal, compared to tumor, tissue
Z is group variable (0 for normal; 1 for tissue)
ϒ = age affect
W = age group
Add a component to look for age by tissue interaction effects: δZ*W
limma
• R package for differential gene expression that
uses linear modeling for each gene in your
data set
• Expression data will be log-intensity values for
Affy data
• Designed to be used in conjunction with the
affy package
Information on limma
• http://www.statsci.org/smyth/pubs/limmabiocbook-reprint.pdf
• http://www.bioconductor.org/packages/2.3/bi
oc/vignettes/limma/inst/doc/usersguide.pdf
limma checklist
• Assumes you’ve done an experiment and have
CEL files (if you’ve done single color Affy
arrays)
• Assumes you have data/information about the
arrays (Targets)
• Assumes you have normalized your data and
have an exprSet object
Name
MT1
MT2
MT3
WT1
WT2
WT3
FileName
Target
MTP1_Ackerman.CEL
MTP2_Ackerman.CEL
MTP3_Ackerman.CEL
WTP1_Ackerman.CEL
WTP2_Ackerman.CEL
WTP3_Ackerman.CEL
MT
MT
MT
WT
WT
WT
This is my targets file for limma
using the Ackerman data.
Note that I renamed the CEL files
compared to what was originally
in my home directory.
new('exprSet',
exprs = ...., # Object of class matrix
se.exprs = ...., # Object of class matrix
phenoData = ...., # Object of class phenoData
annotation = ...., # Object of class character
description = ...., # Object of class MIAME
notes = ...., # Object of class character
)
ExpressionSet object
slotNames()
Slots
exprs:
Object of class "matrix" The observed expression levels. This is a matrix with columns representing
patients or cases and rows representing genes.
se.exprs:
Object of class "matrix" This is a matrix of the same dimensions as exprs which contains standard
error estimates for the estimated expression levels.
phenoData:
Object of class "phenoData" This is an instance of class phenoData containing the patient (or case)
level data. The columns of the pData slot of this entity represent variables and the rows represent
patients or cases.
annotation
A character string identifying the annotation that may be used for the exprSet instance.
description:
Object of class "MIAME". For compatibility with previous version of this class description can also be
a "character". The clase characterOrMIAME has been defined just for this.
notes:
Object of class "character" Vector of explanatory text
http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/Biobase/html/exprSet-class.html
Running limma
• Need to create an exprSet object using the affy
package
– Or some other method…depends on the array
platform
• Need a design matrix
– Representation of the different RNA targets which
have been hybridized to the array
• Can have a contrast matrix
– Uses information in the design matrix to do
comparisons of interest
– Don’t always need a contrast matrix…..
library(affy)
library(limma)
library(makecdfenv)
Array.CDF = make.cdf.env("MoGene-1_0-st-v1.cdf")
CELData=ReadAffy()
CELData@cdfName="Array.CDF"
slotNames(CELData)
pData(CELData)
eset=rma(CELData)
pData(eset)
strain=c("MT","MT","MT","WT","WT","WT")
design=model.matrix(~factor(strain))
colnames(design)=c("MT","WT")
fit=lmFit(eset,design)
fit=eBayes(fit)
options(digits=2)
topTable(fit, coef=2, n=40, adjust="BH")
Time Series
• Differential gene expression methods don’t work
well for time series
– Assumption of independence of observations doesn’t
hold in time series
• BETR takes correlations/dependencies into
account to detect changes in gene expression
that are sustained over time
• http://bioc.ism.ac.jp/2.5/bioc/html/betr.html
• http://bioc.ism.ac.jp/2.5/bioc/vignettes/betr/inst
/doc/betr.pdf
Running BETR
• Need a data frame that describes the arrays
• Need to specify the conditions/contrasts
betr() function usage and arguments
The file describes a three time
point time series of diaphragm
development.
This annotation file has the list of
CEL files, associates them with a
time point, and indicates which
arrays are replicates (must be an
event number)
In this example, this file is called
“samples3.txt”
These data ARE available in GEO
GSE35243
library(betr)
library(affy)
library(Biobase)
test = read.AnnotatedDataFrame("samples3.txt", sep="\t", quote="")
test.data = ReadAffy(phenoData=test)
norm.data = rma(test.data)
prob.data=betr(eset=norm.data, twoColor=FALSE, twoCondition=NULL,
+timepoint=as.numeric(pData(norm.data)$time),
+replicate=as.character(pData(norm.data)$rep), alpha=0.05)
write.table(prob.data, file=”betr_results.txt”, sep=”\t”)
Next time
• pbx1 assignment…..find location of the probes
in another one of the probesets for zebrafish.
• Read limma documentation
• Run limma on your data set
• Be sure you have your Galaxy account set up
Download