Statistical Issues in the Design of Microarray Experiment Jean Yee Hwa Yang University of California, San Francisco http://www.biostat.ucsf.edu/jean/ Stat 246 Lecture 18, Mar 30, 2004 SAGE ** *** Nylon membrane Illumina Bead Array Different Technologies GeneChip Affymetrix cDNA microarray Agilent: Long oligo Ink Jet CGH From W. Gibbs, Scientific American, 2001 Life Cycle Biological question Experimental design Failed Microarray experiment Quality Measurement Image analysis Preprocessing Normalization Pass Analysis Estimation Testing Clustering Biological verification and interpretation Discrimination Experimental design Proper experimental design is needed to ensure that questions of interest can be answered and that this can be done accurately, given experimental constraints, such as cost of reagents and availability of mRNA. Definition of probe and target cDNA “A” Cy5 labeled TARGET PROBE cDNA “B” Cy3 labeled Two main aspects of array design Design of the array Allocation of mRNA samples to the slides cDNA “A” Cy5 labeled cDNA “B” Cy3 labeled Arrayed Library (96 or 384-well plates of bacterial glycerol stocks) Spot as microarray on glass slides Hybridization Two main aspects of array design Design of the array Allocation of mRNA samples to the slides WT MT Hybridization Some aspects of design 1. Layout of the array Which sequence to print? - Affymetrix – selecting the PM and MM’s. - cDNA Library – Riken, NIA etc - Selecting oligo probes: - Operon (commercial). Aglient (commercial). Illumina (commercial). OligoPicker (Wang & Seed, Harvard) ArrayOligoSelector (Bozdech etal, DeRisi Lab, UCSF) OligoArray 1.0 (Rouillard, Herbert and Zuker) Controls: Normalization and quality checks. Number: Duplicate or replicate spots within a slide position. Commonly asked questions • Should we put duplicates on a slides. [Discussion in Smyth et al (2003)] • What should be the percentage of control spots? • Where should the control spots be placed? [These relates to preprocessing such as quality assessment and normalization]. External Controls External controls From Van de Peppel et al, 2003 From Van de Peppel et al, 2003 References • Microarray sample pool. - Yang et al (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, Vol. 30, No. 4, e15. • Titrated external controls. - J. van de Peppel, et al (2003). Monitoring global mRNA changes with externally controlled microarray experiments. Embo Reports 4: 387-393. • An example of doping controls for microarray. - http://genome-www.stanford.edu/turnover/supplement.shtml • Commercial - Lucidea Universal ScoreCard by Amersham. http://www1.amershambiosciences.com/ Some aspects of design 2. Allocation of samples to the slides A Types of Samples - Replication – technical, biological. Pooled vs individual samples. Pooled vs amplification samples. This relates to both Affymetrix and two color spotted array. B Different design layout - Scientific aim of the experiment. Robustness. Extensibility. Efficiency. Taking physical limitations or cost into consideration: - the number of slides. the amount of material. Avoidance of bias • Conditions of an experiment; mRNA extraction and processing, the reagents, the operators, the scanners and so on can leave a “global signature” in the resulting expression data. • Randomization. • Local control is the general term used for arranging experimental material. Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization Biological Replicates Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization Technical replicates Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization Technical replicates Technical replication - amplification Olfactory bulb experiment: • 3 sets of Anterior vs Dorsal performed on different days • #10 and #12 were from the same RNA isolation and amplification • #12 and #18 were from different dissections and amplifications • All 3 data sets were labeled separately before hybridization Data provided by Dave Lin (Cornell) Pooling: looking at very small amount of tissues Mouse model Dissection of tissue RNA Isolation Pooling Probe labelling Hybridization Design 1 Pooled vs Individual samples Design 2 Taken from Kendziorski etl al (2003) Pooled vs Individual samples • Pooling is seen as “biological averaging”. • Trade off between - Cost of performing a hybridization. - Cost of the mRNA samples. Cost or mRNA samples << Cost per hybridization Pooling can assists reducing the number of hybridization. Pooled vs amplified samples amplification amplification Design A amplification amplification pooling Design B pooling Original samples Amplified samples Pooled vs Amplified samples • In the cases where we do not have enough material from one biological sample to perform one array (chip) hybridizations. Pooling or Amplification are necessary. • Amplification - Introduces more noise. - Non-linear amplification (??), different genes amplified at different rate. - Able to perform more hybridizations. • Pooling - Less replicates hybridizations. References • Pooling vs Non-Pooling - Han, E.-S., Wu, Y., Bolstad, B., and Speed, T. P. (2003). A study of the effects of pooling on gene expression estimates using high density oligonucleotide array data. Department of Biological Science, University of Tulsa, February 2003. - Kendziorski, C.M., Y. Zhang, H. Lan, and A.D. Attie. (2003). The efficiency of mRNA pooling in microarray experiments. Biostatistics 4, 465-477. 7/2003 - Xuejun Peng, Constance L Wood, Eric M Blalock, Kuey Chu Chen, Philip W Landfield, Arnold J Stromberg (2003). Statistical implications of pooling RNA samples for microarray experiments. BMC Bioinformatics 4:26. 6/2003 Some aspects of design 2. Allocation of samples to the slides A Types of Samples - Replication – technical, biological. Pooled vs individual samples. Pooled vs amplification samples. B Different design layout - Scientific aim of the experiment. Robustness. Extensibility. Efficiency. Taking physical limitation or cost into consideration: - the number of slides. the amount of material. Graphical representation Vertices: mRNA samples; Edges: hybridization; Direction: dye assignment. Cy3 sample Cy5 sample Graphical representation • The structure of the graph determines which effects can be estimated and the precision of the estimates. - Two mRNA samples can be compared only if there is a path joining the corresponding two vertices. - The precision of the estimated contrast then depends on the number of paths joining the two vertices and is inversely related to the length of the paths. • Direct comparisons within slides yield more precise estimates than indirect ones between slides. Common reference design T1 T2 Tn-1 Tn Ref • Experiment for which the common reference design is appropriate Meaningful biological control (C) Identify genes that responded differently / similarly across two or more treatments relative to control. Large scale comparison. To discover tumor subtypes when you have many different tumor samples. • Advantages: Ease of interpretation. Extensibility - extend current study or to compare the results from current study to other array projects. Experiment for which a number of designs are suitable for use Time Series T1 T2 T3 T4 Ref T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 Experiment for which a number of designs are suitable for use 4 samples A B C A.B C A C A C A B A.B B A.B B A.B Comparing 2 classes of estimates direct vs indirect estimates The simplest design question: Direct versus indirect comparisons Two samples (A vs B) e.g. KO vs. WT or mutant vs. WT Indirect A Direct A B average (log (A/B)) 2 /2 R B log (A / R) – log (B / R ) 22 These calculations assume independence of replicates: the reality is not so simple. In practice: data Wt Wt Mutant Mutant Method A Method B The analyses can be done in two ways: A. Exclude the 2 wt vs. wt arrays (one-sample) B. Include the 2 wt vs. wt arrays (two-samples) Data provided by Grant Hartzog (UCSC) Experimental results • 5 sets of experiments with similar structure. • Compare (Y axis) • Theoretical ratio of (A / B) is 1.6 • Experimental observation is 1.1 to 1.4. SE A) SE for aveMmt B) SE for aveMmt – aveMwt Sample size Data provided by Matt Callow Caveat • The advantage of direct over indirect comparisons was first pointed out by Churchill & Kerr, and in general, we agree with the conclusion. However, you can see in the last MA-plot that the difference is not a factor of 2, as theory predicts. • Why? Possibly because mRNA from the same extractions - and pools of controls or reference material are the norm - give correlated expression levels. In other words, the assumption of independence between log(T/Ref) and log(C/Ref) is not valid. Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization Extreme technical replication - labeling • 3 sets of self – self hybridization: (cerebellum vs cerebellum) • Data 1 and Data 2 were labeled together and hybridized on two slides separately. • Data 3 was labeled separately. • Comparing log-ratios between the 3 experiments Data 1 Data 1 Data provided by Elva Diaz (UC Davis) Pairs plot of log-intensity Looking only at constantly expressed genes only A A’ B B’ a= log2A and b = log2B Technical replicates Data provided by Grant Hartzog and Todd Burcin from UCSC A B a = log2A and b = log2B a’ = log2A’ and b’ = log2B’ t2 = var(a) ; variance of log signal g1 = cov(a, b); covariance between measurements on samples on the same slide. g2 = cov(a, a’); covariance between measurements on technical replicates from different slides. g3 = cov(a, b’); covariance between measurements on samples which are not technical replicates and not on the same slide. A B B A A C D C Implication for design 2 = var(a-b) = 2(t2 – g1) c0 = cov(a – b, c – d) = 0 c1 = cov(a – b, a’ – b) = g2 – g3 c2 = cov(a – b, a’ – b’) = 2(g2 – g3) = 2c1 B Direct vs Indirect - revisited Two samples (A vs B) e.g. KO vs. WT or mutant vs. WT Indirect A Direct A B R B y = (a – b) + (a’ – b’) y = (a – r) - (b – r’) Var(y/2) = 2 /2 + c1 Var(y) = 22 - 2c1 2 = 2c1 c1 = 0 efficiency ratio (Indirect / Direct) = 1 efficiency ratio (Indirect / Direct) = 4 Summary • Create highly correlated reference samples to overcome inefficiency in common reference design. • Not advocating the use of technical replicates in place of biological replicates for samples of interest. Linear Models Gene expression data Gene expression data on G genes for n hybridization. Gene by array data matrix mRNA samples sample1 sample2 sample3 sample4 sample5 … Genes 1 2 3 4 0.46 -0.10 0.15 -0.45 0.30 0.49 0.74 -1.03 0.80 0.24 0.04 -0.79 1.51 0.06 0.10 -0.56 0.90 0.46 0.20 -0.32 ... ... ... ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Gene expression level of gene i in mRNA sample j M= Log( Red intensity / Green intensity) Function (PM, MM) using MAS or dchip or RMA Combining data across arrays How can we design experiments and combine data across slides to provide accurate estimates of the effects of interest? Linear models and extensions thereof can be used to effectively combine data across arrays for complex experimental designs. C A B AB Experimental design Regression analysis Comparing K treatments (i) Indirect design A B R (ii) Direct design C C A B Question: Which design gives the most precise estimates of the contrasts A-B, A-C, and B-C? Common reference design A B C Log ratios: y Model: E(y1) = A – R E(y2) = B – R E(y3) = C – R R Question 1: Estimate the log-ratios between sample A and B (the parameter is: a-b) Calculate y1-y2 for every gene Question 2: Which genes are differentially expressed in at least one of the samples relative to R. Calculate the F-statistic for every gene Linear model analysis Gene by Gene analysis C Log ratios: y Parameters: b = ( a-p, l-p ), where a = log2A, p = log2P and l = log2L Model: E(y1) = c – b E(y2) = b – a E(y3) = a – c y1 1 0 b c E y2 0 1 y 1 1 b a 3 bˆ X ' X 1 X y A B y1 Cov y2 2 I 3 y 3 Cov( bˆ ) 2 X ' X 1 I (a) Common reference A P I (b) Common reference L A P 2 2 2 w w Number of Slides N=3 N=6 Ave. variance 2 II Direct comparison L A 2 Units of material Ave. variance For k = 3, efficiency ratio (Design I(a) / Design II) = 3 In general, efficiency ratio = 2k / (k-1) L P N=3 0.67 Experimental design • Efficiency can be measured in terms of different quantities - number of slides or hybridizations; - units of biological material, e.g. amount of mRNA for one channel. I (a) Common reference A P I (b) Common reference L A P 2 2 2 w w Number of Slides N=3 N=6 Ave. variance 2 Units of material A=B=C=1 Ave. variance II Direct comparison L A 2 L P N=3 0.67 A=B=C=2 A=B=C=2 1 0.67 For k = 3, efficiency ratio (Design I(b) / Design II) = 1.5 In general, efficiency ratio = k / (k-1) 2 x 2 factorial experiment two factors, two levels each Study the joint effect of two treatments (e.g. drugs), A and B, say, on the gene expression response of tumor cells. There are four possible treatment combinations AB: both treatments are administered; A : only treatment A is administered; B : only treatment B is administered; C : cells are untreated. C A B AB n=12 2 x 2 factorial experiment two factors, two levels each Example 2 Our interest in comparing two strains of mice (mutant and wild-type) at two different times, postnatal and adult. WT.P1 2 WT.P21 +a 4 1 MT.P1 MT.P21 +b +a+b+a.b 4 possible samples: - WT.P1: WT at postnatal - WT.P21: WT at adult (effect of time only) - MT.P1: at postnatal (effect of the mutation only) - MT.P21: MT at adult (effect of both time and the mutation). 3 2 x 2 factorial experiment two factors, two levels each Another possible design: Use C as a common reference y1 = log (A / C) = a + error y2 = log (B / C) = b + error A y1 B y2 y3 y3 = log (AB / C) = a + b + ab + error C Estimate (ab) with y3 - y2 - y1. A.B 2 x 2 factorial experiment two factors, two levels each Study the joint effect of two treatments (e.g. drugs), A and B, say, on the gene expression response of tumor cells. There are four possible treatment combinations AB: both treatments are administered; A : only treatment A is administered; B : only treatment B is administered; C : cells are untreated. C A B AB n=12 2 x 2 factorial experiment For each gene, consider a linear model for the joint effect of treatments A and B on the expression response. AB + a + b + g A + a B + b C : baseline effect; a: treatment A main effect; B main effect; b: treatment g: interaction between treatments A and B. 2 x 2 factorial experiment C C B B A A AB AB C A + a B + b AB + a + b + g Log-ratio M for hybridization A AB estimates AB A b + g Log-ratio M for hybridization A B estimates B A b a Regression analysis • For parameters q a, b, g, define a design matrix X so that E(M)=Xq. • For each gene, estimate model by robust regression, least squares or generalized least squares of q. M 11 0 M 0 12 M 1 21 M 1 22 M 31 1 M 32 1 E M 41 0 M 42 0 M 51 1 M 1 52 M 1 61 M 1 62 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 a 0 b 1 g 1 1 1 0 0 1 ˆ q X ' X X ' M Experimental design • In addition to experimental constraints, design decisions should be guided by the knowledge of which effects are of greater interest to the investigator. E.g. which main effects, which interactions. • The experimenter should thus decide on the comparisons for which he wants the most precision and these should be made within slides to the extent possible. 2 x 2 factorial Indirect A balance of direct and indirect I) II) A B C III) A.B C A IV) C A A.B B B # Slides C A A.B B A.B N=6 Main effect A 0.5 0.67 0.5 NA Main effect B 0.5 0.43 0.5 0.3 Interactio n A.B 1.5 0.67 1 0.67 Table entry: variance Ref: Glonek & Solomon (2002) 2 x 2 factorial Indirect A balance of direct and indirect I) II) A B C III) A.B C A IV) C A A.B B B # Slides C A A.B B A.B N=6 Main effect A 0.5 0.67 0.5 NA Main effect B 0.5 0.43 0.5 0.3 Interactio n A.B 1.5 0.67 1 0.67 Table entry: variance Ref: Glonek & Solomon (2002) References - T. P. Speed and Y. H Yang (2002). Direct versus indirect designs for cDNA microarray experiments. Sankhya : The Indian Journal of Statistics, Vol. 64, Series A, Pt. 3, pp 706-720 - Y.H. Yang and T. P. Speed (2003). Design and analysis of comparative microarray Experiments In T. P Speed (ed) Statistical analysis of gene expression microarray data, Chapman & Hall. - R. Simon, M. D. Radmacher and K. Dobbin (2002). Design of studies using DNA microarrays. Genetic Epidemiology 23:21-36. - F. Bretz, J. Landgrebe and E. Brunner (2003). Efficient design and analysis of two color factorial microarray experiments. Biostaistics. - G. Churchill (2003). Fundamentals of experimental design for cDNA microarrays. Nature genetics review 32:490-495. - G. Smyth, J. Michaud and H. Scott (2003) Use of within-array replicate spots for assessing differential experssion in microarray experiments. Technical Report In WEHI. - Glonek, G. F. V., and Solomon, P. J. (2002). Factorial and time course designs for cDNA microarray experiments. Technical Report, Department of Applied Mathematics, University of Adelaide. 10/2002 Acknowledgments Statistics Biology Gordon Smyth (WEHI) Natalie Thorne (WEHI) Terry Speed (Berkeley) Sandrine Dudoit (Berkeley) David Erle (UCSF) Grant Hartzog(UCSC) Todd Burcin (UCSC) Dave Lin (Cornell) John Ngai (Berkeley) Elva Diaz (UC Davis) Matt Callow (LBL) Yuanyuan Xiao (UCSF) Mark Segal (UCSF) Ru Fang Yeh (UCSF) Single factor experiment – time course T1 T2 T3 T4 T5 T6 T7 Ref Possible designs: 1) All sample vs common pooled reference 2) All sample vs time 0 3) Direct hybridization between times. Pooled reference Compare to T1 t vs t+1 t vs t+2 t vs t+3 Design choices in time series N=3 t vs t+1 A) T1 as common reference T1 T2 T3 N=4 T2 T3 C) Common reference T1 T2 T3 T1T2 T2T3 T3T4 T1T3 T2T4 T1T4 Ave 1 2 2 1 2 1 1.5 1 1 1 2 2 3 1.67 2 2 2 2 2 2 2 .67 .67 1.67 .67 1.67 1 1.06 .75 .75 .75 1 1 .75 .83 1 .75 1 .75 .75 .75 .83 T4 B) Direct Hybridization T1 t vs t+2 T4 T4 Ref D) T1 as common ref + more T1 T2 T3 T4 E) Direct hybridization choice 1 T1 T2 T3 T4 F) Direct Hybridization choice 2 T1 T2 T3 T4 Linear model analysis A Log ratios: y Parameters: b = ( a-p, l-p ), where a = log2A, p = log2P and l = log2L Model: L P y1 1 0 a p E y2 0 1 y 1 1 l p 3 2 c 1 c1 y1 1 2 2 Cov y2 c1 c1 2 y c c 1 3 1 bˆ X ' 1 X 1 1 X ' y 1 1 Cov( bˆ ) X ' 1 X 1 In a more general framework A P L w A P 2 A 2 L 2 w A g2 g3 c1 2 2(t 2 g 1 ) L P In a more general frame work Design I, II, III, IV