Microarray experimental design Professor Ernst Wit University of Groningen

Microarray experimental design Professor Ernst Wit University of Groningen e.c.wit@rug.nl http://www.math.rug.nl/∼ernst 14 October 2008 Why DOING a microarray experiment? Obvious, isn’t it? To discover • “genes that are differentially expressed between tissues.” • “a typical genomic profile of a particular disease.” • “differences between phenotypic similar pathologies.” • ... ... i.e., to find some differences. 1 Why DESIGNING a microarray experiment? Yeah, why? We want to be sure that the the differences we find are • due to biological differences (e.g. different tissue, different diseases); • not due to other experimental conditions (e.g. microarray batch); • not due to chance variation. Our exp’t should maximize the difference between biological conditions. 2 4 known design principles 1. Vary the conditions of interest, with replicates that are • as many as possible; • biological rather than technical replicates; • as evenly distributed across the conditions of interest. 2. Restriction: Keep conditions not of interest constant. 3. Blocking: Keep track of uninteresting conditions that do vary. 4. Matching: Channels are similar units for effective comparisons. 3 Optimizing assignment of replicates to conditions We shall consider the following three questions: 1. Pooling experimental material. 2. Dual-channel microarray designs: (a) loop designs vs reference designs? (b) dye-swap designs? 3. How to detect differentially expressed genes? 4 A statistical toy model for gene expression log ygcbaxdpkr = θgc + Bb + Aba + La(x) + Dad(θgc) + Pp + Sag + ²gck + ηgckr . where • θgc = true expression for gene g under condition c • Bb = RNA Batch effect (experimenter, time of day, temperature) • Aba = array effect (scanning level, pre/post-washing) • La(x) = location effect (chip, coverslip, washing) • Dad (θgc ) = dye effect (dye, unequal mixing of mixtures, labelling, intensity) • Pp = print pin effect • Sag = spot effect (amount of DNA in the spot printed on slide) • ²gck = biological variation for individual k • ηgckr = within-replicate-k variation of a technical replicate r 5 Normalization The proposed model: log ygcbaxdpkr = θgc + Bb + Aba + La(x) + Dad(θgc) + Pp + Sag + ²gck + ηgckr . It is hoped that the structural artifacts, Bb, Aba, La(x), Dad(θgc), Pp can be eliminated by means of “normalization”. This leaves us a model for gene g at condition c on array a for the rth technical replicate of individual k, 0 log ygcakr = θgc + Sag + ²gck + ηgcakr , 0 where ηgcakr includes non-structural artifacts from nuisance effects. 6 Rephrasing our three questions From the model: Cy3: Cy5: log xikr = θi + S + ²ik + ηikr log xjkr = θj + S + ²jk + ηjkr , We define 3 microarray design problems: 1. How to deal with biological variation ²ik ? 2. How to estimate differential expression δij = θi − θj ? 3. How to deal with bias S? • Simple answer:“taking log-ratios” • Slightly less simple: “insert a random effect” (see analysis) 7 Pooling reduces the amount of biological variation ση2 n, σ²2 nk 0.020 0.015 0.010 0.005 V(bio)/V(tech) = 1, C(array)/C(sample) = 7 V(bio)/V(tech) = 2, C(array)/C(sample) = 7 V(bio)/V(tech) = 1, C(array)/C(sample) = 4 0.000 Variance of difference estimate Minimize error V (δ̂) = + subject to budget B = n k Cs + n Ca, • Cs/2 = costs individual sample extraction • Ca = cost of a measurement (microarray, dyes, etc) 2 4 6 8 10 number of samples in pool (k) 8 Optimal design in 2-channel microarrays Question: “Which design is more efficient in estimating the contrasts?” Reference Design Loop Design 1 1 5 2 5 2 R 4 3 4 3 Note: each arrow stands for 1 microarray (green dye −→ red dye). 9 Some results from the literature ... These are optimal even designs for all the contrasts when the number of arrays is two more than the number of conditions. (Kerr and Churchill 2001; exhaustive search) The problem with these and similar results • not of practical interest. • no indication how this result can be extended. 10 Use od: optimal (interwoven loop) designs in R The function od(nt, ns, optimality = ..., method = "loop") finds • L-optimal/D-optimal • interwoven loop/all designs • for any number of conditions • and any number of slides Can be obtained from smida library at: http://www.maths.lancs.ac.uk/∼wite/book Some conclusions... 11 Loop designs are often optimal... 8 4 3 6 1 3 2 1 5 7 7 5 2 Stochastic Search of A−Optimal Design for Contrasts with Simulated Annealing. Acceptance rate = 36.4 %. Score := Tr[ Inv(X’ X) ] = 28 . Nr of iterations = 1000 . Nr of arrays = 7 . Nr of conditions = 7 4 6 Stochastic Search of A−Optimal Design for Contrasts with Simulated Annealing. Acceptance rate = 37.1 %. Score := Tr[ Inv(X’ X) ] = 42 . Nr of iterations = 1000 . Nr of arrays = 8 . Nr of conditions = 8 L-optimal designs for 7/8 conditions with 7/8 slides, respectively. 12 ...but not always 7 6 9 3 4 8 5 1 2 Stochastic Search of A−Optimal Design for Contrasts with Simulated Annealing. Acceptance rate = 36.3 %. Score := Tr[ Inv(X’ X) ] = 57.5 . Nr of iterations = 1000 . Nr of arrays = 9 . Nr of conditions = 9 L-optimal design with 9 conditions and 9 microarrays. 13 Algorithm can be slow for many slides and conditions 1 15 2 3 14 13 4 5 12 6 11 10 7 9 8 Trace Score for Contrasts: Tr[ Inv(X’X) ] = 35.8007 . No of arrays = 45 . No of conditions = 15 The algorithm requires a large number of iterations when • number of arrays grows ⇒ linear impact • number of conditions grows ⇒ quadratic impact 14 A Compromise: interwoven loop-designs 1 15 2 3 14 13 4 5 12 6 11 10 7 9 8 An interwoven loop design is defined by: • number of conditions (HERE: 15) • number of loops (HERE: 3) • jump size for each loop (HERE: 1, 4, 6) 15 Advantages of an Interwoven Loop design 27 28 29 23 14 30 1 2 3 28 19 12 9 10 3 20 15 22 1 18 24 8 23 9 10 22 21 5 8 4 7 21 13 30 29 6 25 26 7 5 26 11 6 2 4 27 16 11 20 12 19 24 17 25 13 18 17 16 15 14 • Easy to implement — Compare left with right! • Highly efficient — Left is only 0.6% more efficient than Right • Automatic “dye balance” — conditions measured equally often with Cy3/Cy5 16 Dye Swap or Dye Balance? The simple model needs to be expanded for a possible dye effect: log y1kr = θ1 + S + + ²1k + η1kr log y2kr = θ2 + S + D + ²2k + η2kr , Dye swap designs have been proposed as a way to control the dye effect. HOWEVER, interwoven loop designs are more efficient! 5 6 7 8 cond. cond. cond. cond. 10 12 14 16 arrays arrays arrays arrays D 64% 50% 37% 25% L 80% 75% 69% 62% Relative D- and L-efficiency of dye-swap versus interwoven loop. 17 Estimation of effects of interest A nightmare becoming reality... So, you have let yourself be convinced by a statistician to do a more complicated microarray design. It is “optimal”, alright, but how can you estimate the effects of interest? Aim: a step by step guide to analyze the data. 1. Create a variable with fixed effects of interests 2. Create one or more variables with random nuisance effects. 19 Example: analyze data with mixed effect model We observe y=(0.62, 1.16, 1.51, 3.05, 2.78, 3.61, 3.61, 4.93, 5.12, 0.62), using the loop design without biological replication: Loop Design 1 5 2 4 3 We create: • conditions = design matrix • arrays = random spot effect matrix, belonging to Si • bio.reps = random biological effect matrix, belonging to ²ik 20 ... keeping track of the model indices! The experiment involves only technical replicates, i.e.: y1 y2 y3 y4 y5 ... = = = = = = θ1 + S1 + ²1 + η1 θ2 + S1 + ²2 + η2 θ2 + S2 + ²2 + η3 θ3 + S2 + ²3 + η4 θ3 + S3 + ²3 + η5 ... where • Si the spot effect for array i and • ²k is the kth biological replicate in the exp’t. Therefore: • conditions = (1, 2, 2, 3, 3, 4, 4, 5, 5, 1) • arrays = (1, 1, 2, 2, 3, 3, 4, 4, 5, 5) • bio.reps = (1, 2, 2, 3, 3, 4, 4, 5, 5, 1) 21 THEORY: mixed effect models The mixed effect models can be written as: y = Xθ + Zbio²1 + Zarray²2 + η, where • y: spot values from the microarray • X: the (fixed effect) design matrix • θ: gene expression effects (of interest) • Z: the random effect design matrix • ² random effects (of no interest) 22 Analysis of the data in R For a particular gene, the ten channel expression values are given as: y=(0.62, 1.16, 1.51, 3.05, 2.78, 3.61, 3.61, 4.93, 5.12, 0.62) Loop Design 1 5 2 4 3 In R we use the library nlme with the commands: > ex1<-lme(y∼conditions, data=dat, random = list(grp = pdBlocked(list(pdIdent(∼ -1 + bioreps), pdIdent(∼ -1 + arrays))))) 23 Results Linear mixed-effects model fit by REML Data: dat AIC BIC logLik 19.02537 15.90087 -1.512684 Random effects: Composite Structure: Blocked Block 1: bioreps1, bioreps2, bioreps3, bioreps4, bioreps5 Formula: ~-1 + bioreps | grp Structure: Multiple of an Identity bioreps1 bioreps2 bioreps3 bioreps4 bioreps5 StdDev: 0.4363592 0.4363592 0.4363592 0.4363592 0.4363592 Block 2: arrays1, arrays2, arrays3, arrays4, arrays5 Formula: ~-1 + arrays | grp Structure: Multiple of an Identity arrays1 arrays2 arrays3 arrays4 arrays5 StdDev: 0.00000502 0.00000502 0.00000502 0.00000502 0.00000502 Residual StdDev: 0.2315463 Fixed effects: y ~ conditions Value Std.Error DF t-value p-value (Intercept) 0.934733 0.4660646 5 2.005587 0.1012 conditions2 1.144075 0.6591148 5 1.735775 0.1431 conditions3 2.112589 0.6591148 5 3.205191 0.0239 conditions4 3.307854 0.6591148 5 5.018631 0.0040 conditions5 4.342635 0.6591148 5 6.588586 0.0012 24 And what does that mean? Fixed effects: y ~ conditions Value Std.Error DF t-value p-value (Intercept) 0.934733 0.4660646 5 2.005587 0.1012 conditions2 1.144075 0.6591148 5 1.735775 0.1431 conditions3 2.112589 0.6591148 5 3.205191 0.0239 conditions4 3.307854 0.6591148 5 5.018631 0.0040 conditions5 4.342635 0.6591148 5 6.588586 0.0012 For this one gene, it means that at a 5% significance cut-off: • condition 2 is not obviously different from condition 1; • conditions 3–5 are all significantly different from condition 1. 25 And if you look at many genes simultaneously...? P-values from 10,000 genes for comparison between condition 1 and 2: Histogram of pval Histogram of pval.DE 600 400 Frequency 600 = 400 Frequency 400 300 + 200 0 0 0 100 200 200 Frequency 800 500 1000 800 600 1200 Histogram of pval.NDE 0.0 0.2 0.4 0.6 0.8 1.0 pval.NDE Not DE p-values 0.0 0.2 0.4 0.6 0.8 pval.DE DE p-values 0.0 0.2 0.4 0.6 0.8 pval All p-values By looking at the right-hand figure, you can deduce: • total number of (non-)differentially expressed genes • False Discovery Rate at each arbitrary cut-off 26 1.0 Conclusions We have met several microarray design issues and concluded: 1. There are several design principles to obey when considering sources of variation of a microarray experiment. 2. carefully assigning samples to conditions can improve estimation: • optimal designs are important in order to reduce cost/time • interwoven loop designs are a good compromise • pooling can be practically helpful. 3. Mixed effect models (an extension of ANOVA) are excellent tools for analyzing microarray data. 27

Microarray experimental design Professor Ernst Wit University of Groningen

Related documents

Products

Support

Microarray experimental design Professor Ernst Wit University of Groningen

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib