Microarray experimental design Professor Ernst Wit University of Groningen

advertisement
Microarray experimental design
Professor Ernst Wit
University of Groningen
e.c.wit@rug.nl
http://www.math.rug.nl/∼ernst
14 October 2008
Why DOING a microarray experiment?
Obvious, isn’t it?
To discover
• “genes that are differentially expressed between tissues.”
• “a typical genomic profile of a particular disease.”
• “differences between phenotypic similar pathologies.”
• ...
... i.e., to find some differences.
1
Why DESIGNING a microarray experiment?
Yeah, why?
We want to be sure that the the differences we find are
• due to biological differences (e.g. different tissue, different diseases);
• not due to other experimental conditions (e.g. microarray batch);
• not due to chance variation.
Our exp’t should maximize the difference between biological conditions.
2
4 known design principles
1. Vary the conditions of interest, with replicates that are
• as many as possible;
• biological rather than technical replicates;
• as evenly distributed across the conditions of interest.
2. Restriction: Keep conditions not of interest constant.
3. Blocking: Keep track of uninteresting conditions that do vary.
4. Matching: Channels are similar units for effective comparisons.
3
Optimizing assignment of replicates to conditions
We shall consider the following three questions:
1. Pooling experimental material.
2. Dual-channel microarray designs:
(a) loop designs vs reference designs?
(b) dye-swap designs?
3. How to detect differentially expressed genes?
4
A statistical toy model for gene expression
log ygcbaxdpkr = θgc + Bb + Aba + La(x) + Dad(θgc) + Pp + Sag + ²gck + ηgckr .
where
• θgc = true expression for gene g under condition c
• Bb = RNA Batch effect (experimenter, time of day, temperature)
• Aba = array effect (scanning level, pre/post-washing)
• La(x) = location effect (chip, coverslip, washing)
• Dad (θgc ) = dye effect (dye, unequal mixing of mixtures, labelling, intensity)
• Pp = print pin effect
• Sag = spot effect (amount of DNA in the spot printed on slide)
• ²gck = biological variation for individual k
• ηgckr = within-replicate-k variation of a technical replicate r
5
Normalization
The proposed model:
log ygcbaxdpkr = θgc + Bb + Aba + La(x) + Dad(θgc) + Pp + Sag + ²gck + ηgckr .
It is hoped that the structural artifacts,
Bb, Aba, La(x), Dad(θgc), Pp
can be eliminated by means of “normalization”.
This leaves us a model for gene g at condition c on array a for the rth
technical replicate of individual k,
0
log ygcakr = θgc + Sag + ²gck + ηgcakr
,
0
where ηgcakr
includes non-structural artifacts from nuisance effects.
6
Rephrasing our three questions
From the model:
Cy3:
Cy5:
log xikr = θi + S + ²ik + ηikr
log xjkr = θj + S + ²jk + ηjkr ,
We define 3 microarray design problems:
1. How to deal with biological variation ²ik ?
2. How to estimate differential expression δij = θi − θj ?
3. How to deal with bias S?
• Simple answer:“taking log-ratios”
• Slightly less simple: “insert a random effect” (see analysis)
7
Pooling reduces the amount of biological variation
ση2
n,
σ²2
nk
0.020
0.015
0.010
0.005
V(bio)/V(tech) = 1,
C(array)/C(sample) = 7
V(bio)/V(tech) = 2,
C(array)/C(sample) = 7
V(bio)/V(tech) = 1,
C(array)/C(sample) = 4
0.000
Variance of difference estimate
Minimize error V (δ̂) =
+
subject to budget B = n k Cs + n Ca,
• Cs/2 = costs individual sample extraction
• Ca
= cost of a measurement (microarray, dyes, etc)
2
4
6
8
10
number of samples in pool (k)
8
Optimal design in 2-channel microarrays
Question: “Which design is more efficient in estimating the contrasts?”
Reference Design
Loop Design
1
1
5
2
5
2
R
4
3
4
3
Note: each arrow stands for 1 microarray (green dye −→ red dye).
9
Some results from the literature ...
These are optimal even designs
for all the contrasts when the
number of arrays is two more than
the number of conditions.
(Kerr and Churchill 2001; exhaustive search)
The problem with these and similar results
• not of practical interest.
• no indication how this result can be
extended.
10
Use od: optimal (interwoven loop) designs in R
The function
od(nt, ns, optimality = ..., method = "loop")
finds
• L-optimal/D-optimal
• interwoven loop/all designs
• for any number of conditions
• and any number of slides
Can be obtained from smida library at:
http://www.maths.lancs.ac.uk/∼wite/book
Some conclusions...
11
Loop designs are often optimal...
8
4
3
6
1
3
2
1
5
7
7
5
2
Stochastic Search of A−Optimal Design for Contrasts with Simulated Annealing.
Acceptance rate = 36.4 %. Score := Tr[ Inv(X’ X) ] = 28 . Nr of iterations = 1000 .
Nr of arrays = 7 . Nr of conditions = 7
4
6
Stochastic Search of A−Optimal Design for Contrasts with Simulated Annealing.
Acceptance rate = 37.1 %. Score := Tr[ Inv(X’ X) ] = 42 . Nr of iterations = 1000 .
Nr of arrays = 8 . Nr of conditions = 8
L-optimal designs for 7/8 conditions with 7/8 slides, respectively.
12
...but not always
7
6
9
3
4
8
5
1
2
Stochastic Search of A−Optimal Design for Contrasts with Simulated Annealing.
Acceptance rate = 36.3 %. Score := Tr[ Inv(X’ X) ] = 57.5 . Nr of iterations = 1000 .
Nr of arrays = 9 . Nr of conditions = 9
L-optimal design with 9 conditions and 9 microarrays.
13
Algorithm can be slow for many slides and conditions
1
15
2
3
14
13
4
5
12
6
11
10
7
9
8
Trace Score for Contrasts: Tr[ Inv(X’X) ] = 35.8007 .
No of arrays = 45 . No of conditions = 15
The algorithm requires a large number of iterations when
• number of arrays grows
⇒ linear impact
• number of conditions grows ⇒ quadratic impact
14
A Compromise: interwoven loop-designs
1
15
2
3
14
13
4
5
12
6
11
10
7
9
8
An interwoven loop design is defined by:
• number of conditions
(HERE: 15)
• number of loops
(HERE: 3)
• jump size for each loop
(HERE: 1, 4, 6)
15
Advantages of an Interwoven Loop design
27
28
29
23
14
30
1
2
3
28
19
12
9
10
3
20
15
22
1
18
24
8
23
9
10
22
21
5
8
4
7
21
13
30
29
6
25
26
7
5
26
11
6
2
4
27
16
11
20
12
19
24
17
25
13
18
17
16
15
14
• Easy to implement — Compare left with right!
• Highly efficient — Left is only 0.6% more efficient than Right
• Automatic “dye balance” — conditions measured equally often with Cy3/Cy5
16
Dye Swap or Dye Balance?
The simple model needs to be expanded for a possible dye effect:
log y1kr = θ1 + S +
+ ²1k + η1kr
log y2kr = θ2 + S + D + ²2k + η2kr ,
Dye swap designs have been proposed as a way to control the dye effect.
HOWEVER, interwoven loop designs are more efficient!
5
6
7
8
cond.
cond.
cond.
cond.
10
12
14
16
arrays
arrays
arrays
arrays
D
64%
50%
37%
25%
L
80%
75%
69%
62%
Relative D- and L-efficiency of dye-swap versus interwoven loop.
17
Estimation of effects of interest
A nightmare becoming reality...
So, you have let yourself be convinced by a statistician to do a more
complicated microarray design.
It is “optimal”, alright, but how can you estimate the effects of interest?
Aim: a step by step guide to analyze the data.
1. Create a variable with fixed effects of interests
2. Create one or more variables with random nuisance effects.
19
Example: analyze data with mixed effect model
We observe y=(0.62, 1.16, 1.51, 3.05, 2.78, 3.61, 3.61, 4.93, 5.12, 0.62), using the
loop design without biological replication:
Loop Design
1
5
2
4
3
We create:
• conditions = design matrix
• arrays
= random spot effect matrix, belonging to Si
• bio.reps = random biological effect matrix, belonging to ²ik
20
... keeping track of the model indices!
The experiment involves only technical replicates, i.e.:
y1
y2
y3
y4
y5
...
=
=
=
=
=
=
θ1 + S1 + ²1 + η1
θ2 + S1 + ²2 + η2
θ2 + S2 + ²2 + η3
θ3 + S2 + ²3 + η4
θ3 + S3 + ²3 + η5
...
where
• Si the spot effect for array i and
• ²k is the kth biological replicate in the exp’t.
Therefore:
• conditions = (1, 2, 2, 3, 3, 4, 4, 5, 5, 1)
• arrays
= (1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
• bio.reps = (1, 2, 2, 3, 3, 4, 4, 5, 5, 1)
21
THEORY: mixed effect models
The mixed effect models can be written as:
y = Xθ + Zbio²1 + Zarray²2 + η,
where
• y: spot values from the microarray
• X: the (fixed effect) design matrix
• θ: gene expression effects (of interest)
• Z: the random effect design matrix
• ² random effects (of no interest)
22
Analysis of the data in R
For a particular gene, the ten channel expression values are given as:
y=(0.62, 1.16, 1.51, 3.05, 2.78, 3.61, 3.61, 4.93, 5.12, 0.62)
Loop Design
1
5
2
4
3
In R we use the library nlme with the commands:
> ex1<-lme(y∼conditions, data=dat,
random = list(grp = pdBlocked(list(pdIdent(∼ -1 + bioreps),
pdIdent(∼ -1 + arrays)))))
23
Results
Linear mixed-effects model fit by REML
Data: dat
AIC
BIC
logLik
19.02537 15.90087 -1.512684
Random effects:
Composite Structure: Blocked
Block 1: bioreps1, bioreps2, bioreps3, bioreps4, bioreps5
Formula: ~-1 + bioreps | grp
Structure: Multiple of an Identity
bioreps1 bioreps2 bioreps3 bioreps4 bioreps5
StdDev: 0.4363592 0.4363592 0.4363592 0.4363592 0.4363592
Block 2: arrays1, arrays2, arrays3, arrays4, arrays5
Formula: ~-1 + arrays | grp
Structure: Multiple of an Identity
arrays1
arrays2
arrays3
arrays4
arrays5
StdDev:
0.00000502 0.00000502 0.00000502 0.00000502 0.00000502
Residual
StdDev: 0.2315463
Fixed effects: y ~ conditions
Value Std.Error DF t-value p-value
(Intercept) 0.934733 0.4660646 5 2.005587 0.1012
conditions2 1.144075 0.6591148 5 1.735775 0.1431
conditions3 2.112589 0.6591148 5 3.205191 0.0239
conditions4 3.307854 0.6591148 5 5.018631 0.0040
conditions5 4.342635 0.6591148 5 6.588586 0.0012
24
And what does that mean?
Fixed effects: y ~ conditions
Value Std.Error DF t-value p-value
(Intercept) 0.934733 0.4660646 5 2.005587 0.1012
conditions2 1.144075 0.6591148 5 1.735775 0.1431
conditions3 2.112589 0.6591148 5 3.205191 0.0239
conditions4 3.307854 0.6591148 5 5.018631 0.0040
conditions5 4.342635 0.6591148 5 6.588586 0.0012
For this one gene, it means that at a 5% significance cut-off:
• condition 2 is not obviously different from condition 1;
• conditions 3–5 are all significantly different from condition 1.
25
And if you look at many genes simultaneously...?
P-values from 10,000 genes for comparison between condition 1 and 2:
Histogram of pval
Histogram of pval.DE
600
400
Frequency
600
=
400
Frequency
400
300
+
200
0
0
0
100
200
200
Frequency
800
500
1000
800
600
1200
Histogram of pval.NDE
0.0
0.2
0.4
0.6
0.8
1.0
pval.NDE
Not DE p-values
0.0
0.2
0.4
0.6
0.8
pval.DE
DE p-values
0.0
0.2
0.4
0.6
0.8
pval
All p-values
By looking at the right-hand figure, you can deduce:
• total number of (non-)differentially expressed genes
• False Discovery Rate at each arbitrary cut-off
26
1.0
Conclusions
We have met several microarray design issues and concluded:
1. There are several design principles to obey when considering sources
of variation of a microarray experiment.
2. carefully assigning samples to conditions can improve estimation:
• optimal designs are important in order to reduce cost/time
• interwoven loop designs are a good compromise
• pooling can be practically helpful.
3. Mixed effect models (an extension of ANOVA) are excellent tools for
analyzing microarray data.
27
Download