Genomic Profiles of Brain Tissue in Humans and Chimpanzees

advertisement
Genomic Profiles of Brain
Tissue in Humans and
Chimpanzees
Data
http://www.ebi.ac.uk/arrayexpress
The samples are hybridized to Affymetrix Genechip® Human
Genome U95B [HG_U95B] (A-AFFY-1)
http://www.ebi.ac.uk/aerep/dataselection?expid=352682
122
Samples:
3 humans and 3 chimps
7 brain regions
some samples have multiple hybridizations to other Hu arrays
I selected 4 regions, 1 sample per biological replicate per region =
24 arrays
Data
Our set up is this:
cortex
Brain i
cerebellum
caudate
human or chimp
Broca
This is called a split plot design. The cortex and
cerebellum from the same brain are more similar than
cortex and cerebellum from different brains in the same
species. This induces a random effect for biological
replicate. We will not deal with this in the analysis, but
Limma can handle 1 random effect.
Pre-processing
Since I do not have enough memory to read in the
arrays and then normalize, I used
brains=justRMA()
to store the normalized probeset summaries. "affy"
automatically downloaded the HG_U95Av2 cdf to
attach probeset names.
The probeset names can be recovered using
the "slot" geneNames:
gNames=geneNames(brains)
Pre-processing
We do not want to use the Affymetrix control probes in
the analysis, so we should eliminate these.
length(gName)
cbind(12500:12625,gNames[12500:12625])
Pre-processing
We also need the sample information which we will use
to define the treatments. This is stored in the
"phenotype data"
pData(brains)
Only the array names have been stored. To add the
species and brain region information, we need to add
columns to the phenotype data.
brain.exprs=exprs(brains)[1:12558,]
Pre-processing
We also need the sample information which we will use to define the
treatments. This is stored in the "phenotype data"
pData(brains)
Only the array names have been stored. To add the species and brain region
information, we need to add columns to the phenotype data.
pData(brains)=cbind(pData(brains),
species=c(rep("chimp",12),rep("human",12)),
tissue=rep(c("pcortex","caudate","cerebellum",
"broca"),6),
trts=c(rep(c("ca","cb","cc","cd"),3),
rep(("ha","hb","hc","hd",3)))
Running Limma
There are 3 steps to running Limma:
Run the least squares mixed model.
Adjust the t-tests by using an eBayes model on
the variances.
Use multiple comparisons adjustments to select
genes that have statistically significant
differential expression.
Limma Models
A typical model for Y=log(expression) for these data
would be:
Y=m+s+r+(sr)+e
m
s
r
(sr)
e
average over all the genes
species main (average) effect
brain region main effect
interaction
random error
Limma can run this model, but the resulting test statistics
test that all the parameters are 0, including m
Limma Models
Computationally, limma does regression on a design
matrix. If we understand design matrices, we can
better understand how to get the information we need
from limma.
(Here I move to the board and explain indicator variables
and contrasts)
Creating a Design Matrix
with a Formula
factorial=model.matrix(~species*tissue,data=brain.pheno)
Sets up a design matrix with a column of 1's and (0,1) indicator for
the levels of species and tissue.
The levels are put in alphabetical order, and the first category is
omitted.
This is the preferred formulation for what is called a "type 3"
analysis, which is what we teach in Statistics class, but not very
convenient for Limma analysis, which provides a t-test for the
regression coefficients.
Creating a Design Matrix
Generally we are interested in questions like:
Does the mean expression of this gene differ in chimps and
humans?
Does the expression of this gene in the cortex differ between
chimps and humans.
These are most readily expressed as contrasts among means.
What I find most convenient is to start by setting up a design
matrix for the treatments, using the cell means model.
This provides the required estimate of error variance as well as
names for the columns of the design matrix which are useful for
setting up a contrast matrix.
Creating a Design Matrix
design=model.matrix(~trt-1,data=brain.pheno)
fitmeans=lmFit(brain.exprs,design)
We then decide what contrasts we want to do. e.g. We might be
most interested in differences between human and chimp in
each brain region: ca-ha, cb-hb, cc-hc, cd-hd
Or we might want to know if the difference between 2 regions is
the same in human and chimp:
(ca-cb)-(ha-hb)
We can have as many of these differences as we want.
Creating a Design Matrix
I also like to get an F-test which is an overall test of differential
expression.
For this, an set of T-1 independent contrasts will do.
Limma provides an F-test no matter what contrasts are in the
contrast matrix, but this is not a standard test. It corresponds to
the usual ANOVA F-test when T-1 independent contrasts are
provided.
Creating a Design Matrix
e.g. main contrasts of interest are:
average human - average chimp
human - chimp in each region
ave H - ave C = (ha+hb+hc+hd)-(ca+cb+cc+cd)
(-1 -1 -1 -1 1 1 1 1)
difference in a=cortex ha-ca (-1 0 0 0 1 0 0 0)
etc
ave H - ave C = (ha+hb+hc+hd)-(ca+cb+cc+cd)
(-1 -1 -1 -1 1 1 1 1)
difference in a=cortex ha-ca (-1 0 0 0 1 0 0 0)
etc
contrast.matrix=makeContrasts(
main= (trtsha+trtshb+trtshc+trtshd)(trtsca+trtscb+trtscc+trtscd),
cortex=trtsha-trtsca,caudate=trtshbtrtscb,cereb=trtshc-trtscc,
broca=trtshd-trtscd,
Fitting the Contrasts
fit.contrasts=contrasts.fit(fitMean,contrast.matrix)
#eBayes step
efit.contrasts=eBayes(fit.contrasts)
Selecting the Differentially
Expressed Genes
It is nice to look at the p-values before
adjusting for multiple comparisons
par(mfrow=c(2,3))
for (i in 1:5) {
hist(efit.contrasts$p.value[,i],
main=
paste(colnames(efit.contrasts$p.value)[i]))
}
Selecting the Differentially
Expressed Genes
topTable(efit.contrasts,coef="broca",n=20,
adjust="BH")
or look at the qvalues and pi0
library(qvalue)
q.broca=qvalue(efit.contrasts$p.value$broca)
q.broca$pi0
Download