Practical: Experimental Design 14 October 2008 This practical should be done in R and will require two libraries: smida and nlme. Before starting, type the following in R: > library(smida) > library(nlme) 1. Optimal design is a matter of applying the observations to the conditions in such a way that the parameters of interest are estimated most “optimally”. In simple microarray terms: If we had a certain number of dual-channel microarrays available as well as an (unlimited) amount of RNA from several biological conditions of interest, then which conditions should we put on which arrays in order to increase the precision of finding differentially expressed genes? The R-function od uses simulated annealing to calculate the A-optimal or D-optimal design for two-channel microarrays. (a) For calculating an A-optimal design, you should specify: • nt: number of treatment levels (e.g. number of time points) • ns: number of slides Try the function for yourself, e.g. calculate the optimal design for 5 conditions with 5 arrays, ignoring a dye-effect: od1 <- od(5, 5, dye = F) Is the answer what you expected? (b) Are you satisfied about the direction of the arrows in the previous design? As there is no effort in these designs to estimate a possible dye-effect, the direction of the arrows does not really matter in the calculations. However, in practice it might matter. Let’s include a directional “dye” parameter in the model (in order to eliminate it!) and optimize the design also with respect to this parameter: 1 od2a <- od(5, 5, dye = T) How do the arrows go now? What does this mean for each condition? (c) The R-function plot.od can plot the resulting design object in a circle (method = "circle") or in a nice layout (method="sammon"). Try using this function. plot.od(od2a,method="sammon") (d) Are loop effects always optimal?. Find optimal designs for larger number of conditions and slides: od2b <- od(8, 8, dye = T) od2c <- od(9, 9, dye = T) (e) It is possible to use od for any kind of situation. For example, for large designs such as: od3 <- od(15, 45, n.iter=5000) (Note 1: The default 1000 iterations may not be enough. Note 2: dye= T is the default setting and therefore does not have to be specified.) This may look rather messy and sometimes it is easier to redisplay this design via: plot.od(od3, method="sammon", magic = 0.1) plot.od(od3, method="sammon", magic = 2) (f) Still, this solution may not seem very appealing, as it might be difficult to implement in practice without making some mistake. In that case, we could restrict our search to a subclass of symmetric design, the so-called interwoven loop designs. od4 <- od(15, 45, method="loop") 2. In what follows we discuss gene expression data coming from a skin cancer microarray experiment. This data set details a two-channel microarray experiment carried out on cancerous and normal (control) skin tissue samples using 4 arrays. It was conducted by Dr. Nighean Barr at the Cancer Research UK Beatson Laboratories in Glasgow. The data can be found in R via the data(skin) command. (a) T-test. The t-test is a standard way to compare the means of two populations. Apply the t-test (t.test) to the first single gene. Does it seem that the first gene is important? > g1.tt <- t.test(skin$dat[,1]~skin$conditions) > g1.tt > g1.tt$p.value (b) Use the apply function to apply the t-test to all genes. Use a histogram to get an idea of how many genes can be detected to be differentially expressed. 2 > getpvals.tt <- function(y,x){ t.test(y~x)$p.value } > pvals.tt<-apply(skin$dat,2,getpvals.tt,x=skin$conditions) > hist(pvals.tt) (c) Observations on the same physical spot may be correlated. It is therefore sensible to introduce a (random) effect term to take into account this additional correlation. Use a random effects model of the following form: > > > > > > > > > library(nlme) sk1<-NULL sk1$x<-as.factor(skin$conditions) sk1$y<-skin$dat[,1] sk1$array<-rep(1:4,each=2) sk1$grp <- rep(1,8) g1.lme <- lme(y ~ x, dat=sk1, random=list(grp = pdIdent(~-1+array))) summary(g1.lme) summary(g1.lme)$tTable[2,5] Apply this to all the genes (warning: this can take 10 minutes; moreover, eliminate genes 3791, 5646, 5961, 7257 and 9199 as they have convergence problems) and plot the histogram of all the p-values. How many genes seem to be differentially expressed? > getpvals <- function(y,x){ sk<-NULL sk$x<-as.factor(x) sk$y<-y sk$grp <- rep(1,8) sk$array <- rep(1:4,each=2) g.lme <- lme(y ~ x, dat=sk, random=list(grp = pdIdent(~-1+array))) summary(g.lme)$tTable[2,5] } > dat <- skin$dat[,-c(3791, 5646, 5961, 7257, 9199)] > pvals.lme <- apply(dat,2,getpvals,x=skin$conditions) > hist(pvals.lme) (d) Use a 5% significance cut-off to determine the differentially expressed genes. (e) Use a 5% Bonferroni corrected cut-off to determine the differentially expressed genes. (f) Use the Benjamini and Hochberg procedure to calculate the number of differentially expressed genes at an FDR of 5%. (g) Discuss which error rate is the more suitable one in these circumstances. Benjamini & Hochberg FDR procedure. Let k be the largest g (0 ≤ g ≤ n) for which the ordered p-value (P(1) is smallest): P(g) ≤ gα ; n then reject all H(g) , for g = 1, 2, . . . , k, where H(g) is the associated null hypothesis. This guarantees that the FDR is less than or equal to α. 3