Genomic Profiles of Brain Tissue in Humans and Chimpanzees Data http://www.ebi.ac.uk/arrayexpress The samples are hybridized to Affymetrix Genechip® Human Genome U95B [HG_U95B] (A-AFFY-1) http://www.ebi.ac.uk/aerep/dataselection?expid=352682 122 Samples: 3 humans and 3 chimps 7 brain regions some samples have multiple hybridizations to other Hu arrays I selected 4 regions, 1 sample per biological replicate per region = 24 arrays Data Our set up is this: cortex Brain i cerebellum caudate human or chimp Broca This is called a split plot design. The cortex and cerebellum from the same brain are more similar than cortex and cerebellum from different brains in the same species. This induces a random effect for biological replicate. We will not deal with this in the analysis, but Limma can handle 1 random effect. Pre-processing Since I do not have enough memory to read in the arrays and then normalize, I used brains=justRMA() to store the normalized probeset summaries. "affy" automatically downloaded the HG_U95Av2 cdf to attach probeset names. The probeset names can be recovered using the "slot" geneNames: gNames=geneNames(brains) Pre-processing We do not want to use the Affymetrix control probes in the analysis, so we should eliminate these. length(gName) cbind(12500:12625,gNames[12500:12625]) Pre-processing We also need the sample information which we will use to define the treatments. This is stored in the "phenotype data" pData(brains) Only the array names have been stored. To add the species and brain region information, we need to add columns to the phenotype data. brain.exprs=exprs(brains)[1:12558,] Pre-processing We also need the sample information which we will use to define the treatments. This is stored in the "phenotype data" pData(brains) Only the array names have been stored. To add the species and brain region information, we need to add columns to the phenotype data. pData(brains)=cbind(pData(brains), species=c(rep("chimp",12),rep("human",12)), tissue=rep(c("pcortex","caudate","cerebellum", "broca"),6), trts=c(rep(c("ca","cb","cc","cd"),3), rep(("ha","hb","hc","hd",3))) Running Limma There are 3 steps to running Limma: Run the least squares mixed model. Adjust the t-tests by using an eBayes model on the variances. Use multiple comparisons adjustments to select genes that have statistically significant differential expression. Limma Models A typical model for Y=log(expression) for these data would be: Y=m+s+r+(sr)+e m s r (sr) e average over all the genes species main (average) effect brain region main effect interaction random error Limma can run this model, but the resulting test statistics test that all the parameters are 0, including m Limma Models Computationally, limma does regression on a design matrix. If we understand design matrices, we can better understand how to get the information we need from limma. (Here I move to the board and explain indicator variables and contrasts) Creating a Design Matrix with a Formula factorial=model.matrix(~species*tissue,data=brain.pheno) Sets up a design matrix with a column of 1's and (0,1) indicator for the levels of species and tissue. The levels are put in alphabetical order, and the first category is omitted. This is the preferred formulation for what is called a "type 3" analysis, which is what we teach in Statistics class, but not very convenient for Limma analysis, which provides a t-test for the regression coefficients. Creating a Design Matrix Generally we are interested in questions like: Does the mean expression of this gene differ in chimps and humans? Does the expression of this gene in the cortex differ between chimps and humans. These are most readily expressed as contrasts among means. What I find most convenient is to start by setting up a design matrix for the treatments, using the cell means model. This provides the required estimate of error variance as well as names for the columns of the design matrix which are useful for setting up a contrast matrix. Creating a Design Matrix design=model.matrix(~trt-1,data=brain.pheno) fitmeans=lmFit(brain.exprs,design) We then decide what contrasts we want to do. e.g. We might be most interested in differences between human and chimp in each brain region: ca-ha, cb-hb, cc-hc, cd-hd Or we might want to know if the difference between 2 regions is the same in human and chimp: (ca-cb)-(ha-hb) We can have as many of these differences as we want. Creating a Design Matrix I also like to get an F-test which is an overall test of differential expression. For this, an set of T-1 independent contrasts will do. Limma provides an F-test no matter what contrasts are in the contrast matrix, but this is not a standard test. It corresponds to the usual ANOVA F-test when T-1 independent contrasts are provided. Creating a Design Matrix e.g. main contrasts of interest are: average human - average chimp human - chimp in each region ave H - ave C = (ha+hb+hc+hd)-(ca+cb+cc+cd) (-1 -1 -1 -1 1 1 1 1) difference in a=cortex ha-ca (-1 0 0 0 1 0 0 0) etc ave H - ave C = (ha+hb+hc+hd)-(ca+cb+cc+cd) (-1 -1 -1 -1 1 1 1 1) difference in a=cortex ha-ca (-1 0 0 0 1 0 0 0) etc contrast.matrix=makeContrasts( main= (trtsha+trtshb+trtshc+trtshd)(trtsca+trtscb+trtscc+trtscd), cortex=trtsha-trtsca,caudate=trtshbtrtscb,cereb=trtshc-trtscc, broca=trtshd-trtscd, Fitting the Contrasts fit.contrasts=contrasts.fit(fitMean,contrast.matrix) #eBayes step efit.contrasts=eBayes(fit.contrasts) Selecting the Differentially Expressed Genes It is nice to look at the p-values before adjusting for multiple comparisons par(mfrow=c(2,3)) for (i in 1:5) { hist(efit.contrasts$p.value[,i], main= paste(colnames(efit.contrasts$p.value)[i])) } Selecting the Differentially Expressed Genes topTable(efit.contrasts,coef="broca",n=20, adjust="BH") or look at the qvalues and pi0 library(qvalue) q.broca=qvalue(efit.contrasts$p.value$broca) q.broca$pi0