Practical: Experimental Design 14 October 2008

advertisement
Practical: Experimental Design
14 October 2008
This practical should be done in R and will require two libraries: smida and nlme.
Before starting, type the following in R:
> library(smida)
> library(nlme)
1. Optimal design is a matter of applying the observations to the conditions in such
a way that the parameters of interest are estimated most “optimally”. In simple
microarray terms:
If we had a certain number of dual-channel microarrays available as well as
an (unlimited) amount of RNA from several biological conditions of interest,
then which conditions should we put on which arrays in order to increase the
precision of finding differentially expressed genes?
The R-function od uses simulated annealing to calculate the A-optimal or D-optimal
design for two-channel microarrays.
(a) For calculating an A-optimal design, you should specify:
• nt: number of treatment levels (e.g. number of time points)
• ns: number of slides
Try the function for yourself, e.g. calculate the optimal design for 5 conditions
with 5 arrays, ignoring a dye-effect:
od1 <- od(5, 5, dye = F)
Is the answer what you expected?
(b) Are you satisfied about the direction of the arrows in the previous design? As
there is no effort in these designs to estimate a possible dye-effect, the direction
of the arrows does not really matter in the calculations. However, in practice it
might matter. Let’s include a directional “dye” parameter in the model (in order
to eliminate it!) and optimize the design also with respect to this parameter:
1
od2a <- od(5, 5, dye = T)
How do the arrows go now? What does this mean for each condition?
(c) The R-function plot.od can plot the resulting design object in a circle (method
= "circle") or in a nice layout (method="sammon"). Try using this function.
plot.od(od2a,method="sammon")
(d) Are loop effects always optimal?. Find optimal designs for larger number of
conditions and slides:
od2b <- od(8, 8, dye = T)
od2c <- od(9, 9, dye = T)
(e) It is possible to use od for any kind of situation. For example, for large designs
such as:
od3 <- od(15, 45, n.iter=5000)
(Note 1: The default 1000 iterations may not be enough.
Note 2: dye= T is the default setting and therefore does not have to be specified.)
This may look rather messy and sometimes it is easier to redisplay this design
via:
plot.od(od3, method="sammon", magic = 0.1)
plot.od(od3, method="sammon", magic = 2)
(f) Still, this solution may not seem very appealing, as it might be difficult to implement in practice without making some mistake. In that case, we could restrict our
search to a subclass of symmetric design, the so-called interwoven loop designs.
od4 <- od(15, 45, method="loop")
2. In what follows we discuss gene expression data coming from a skin cancer microarray
experiment. This data set details a two-channel microarray experiment carried out on
cancerous and normal (control) skin tissue samples using 4 arrays. It was conducted
by Dr. Nighean Barr at the Cancer Research UK Beatson Laboratories in Glasgow.
The data can be found in R via the data(skin) command.
(a) T-test. The t-test is a standard way to compare the means of two populations.
Apply the t-test (t.test) to the first single gene. Does it seem that the first
gene is important?
> g1.tt <- t.test(skin$dat[,1]~skin$conditions)
> g1.tt
> g1.tt$p.value
(b) Use the apply function to apply the t-test to all genes. Use a histogram to get
an idea of how many genes can be detected to be differentially expressed.
2
> getpvals.tt <- function(y,x){
t.test(y~x)$p.value
}
> pvals.tt<-apply(skin$dat,2,getpvals.tt,x=skin$conditions)
> hist(pvals.tt)
(c) Observations on the same physical spot may be correlated. It is therefore sensible to introduce a (random) effect term to take into account this additional
correlation. Use a random effects model of the following form:
>
>
>
>
>
>
>
>
>
library(nlme)
sk1<-NULL
sk1$x<-as.factor(skin$conditions)
sk1$y<-skin$dat[,1]
sk1$array<-rep(1:4,each=2)
sk1$grp <- rep(1,8)
g1.lme <- lme(y ~ x, dat=sk1, random=list(grp = pdIdent(~-1+array)))
summary(g1.lme)
summary(g1.lme)$tTable[2,5]
Apply this to all the genes (warning: this can take 10 minutes; moreover, eliminate genes 3791, 5646, 5961, 7257 and 9199 as they have convergence problems)
and plot the histogram of all the p-values. How many genes seem to be differentially expressed?
> getpvals <- function(y,x){
sk<-NULL
sk$x<-as.factor(x)
sk$y<-y
sk$grp <- rep(1,8)
sk$array <- rep(1:4,each=2)
g.lme <- lme(y ~ x, dat=sk, random=list(grp = pdIdent(~-1+array)))
summary(g.lme)$tTable[2,5]
}
> dat <- skin$dat[,-c(3791, 5646, 5961, 7257, 9199)]
> pvals.lme <- apply(dat,2,getpvals,x=skin$conditions)
> hist(pvals.lme)
(d) Use a 5% significance cut-off to determine the differentially expressed genes.
(e) Use a 5% Bonferroni corrected cut-off to determine the differentially expressed
genes.
(f) Use the Benjamini and Hochberg procedure to calculate the number of differentially expressed genes at an FDR of 5%.
(g) Discuss which error rate is the more suitable one in these circumstances.
Benjamini & Hochberg FDR procedure. Let k be the largest g (0 ≤ g ≤ n)
for which the ordered p-value (P(1) is smallest):
P(g) ≤
gα
;
n
then reject all H(g) , for g = 1, 2, . . . , k, where H(g) is the associated null hypothesis.
This guarantees that the FDR is less than or equal to α.
3
Download