Optimal Design for Gene Expression Microarrays Wun-Yi Shu Institute of Statistics National Tsing Hua University Yang-Chao Wang Institute of Statistics National Tsing Hua University Abstract We propose a statistical method for constructing an optimal or efficient design for gene expression microarrays by using the linear model’s optimal theory. An essential aspect of experimental design is to optimize the efficiency of the estimation of the unknown parameters’ contrasts using observations generated from that design. We derive a selection criterion, 0 -efficiency , to measure the goodness of any chosen design. The method works by obtaining the theoretical upper bound of the optimality. Besides, we discuss the connection between experimental design and its graphic representation and develop a procedure to construct an efficient design when the number of the varieties is large. Furthermore, we discuss the model’s rationality and the relationship of optimal criterion between log Ratio model we propose and ANOVA model (Kerr et al. (2000)). 1 Introduction: The development of microarray technology produces massive gene expression data sets. A major task for the experimentalist is to understand the structure in the huge data sets. Data generated by a scientific experiment always contain random noise. The situation is worst in the area of biology. Statistical methods must be used to accurately interpret large-scale experimental data. In the last decade, most publications considering statistical problems in the context of microarray experiments focused on the techniques of data analysis. Kerr et al. [6] was the first that brought design issues for microarray experiments to attention. Some of the key questions in microarray experiments are: Given limited resources (e.g., the number of slides is limited), how can one gain as much information as possible? How does one achieve the goal of the experiment using as few slides as possible? What is the effect of missing arrays on a design (see Bretz et al. (2003))? An appropriate design makes the estimation of interested parameters more precise and the statistical tests more sensitive in detecting significant cases. Recently, there are many papers have been published, discussing the issues of how to model the microarray data to describe the gene expression level and propose a proper criterion to evaluate the design of experiment. One of the most representative papers is Kerr et al. [6]. Their group, the Jackson laboratory, has a series of papers to discuss and apply this model (Kerr et al. [6, 7, 8] and Churchill [1], Oleksiak et al. [10], Wu et al. [16]) which are vital reference materials for biologists who are interested in gene expression microarrays. Their main ideas are to use classical ANOVA model to describe data and some incomplete block design theory (Raghavarao [13]) to derive their optimal criterion. Nevertheless, in their optimality criterion score, they couldn’t provide the theoretical upper bound of this score for any number of varieties. Accordingly, in our paper, the major tasks are to propose a log Ratio model which is a more plausible model for microarray data and overcome the knotty problem of theoretical upper bound. Therefore, if a particular design is given, one could know its efficiency compared with the optimal design (which is unknown) using the same number of slides. In this paper, we start with the establishment of the log Ratio model, propose the optimal criterion, and derive the theoretical upper bound. Then, we introduce the graphical representation of designs, provide some computational results of common use optimal designs and supply the method to construct efficient designs for large numbers of varieties. Moreover, we discuss the connection between the selection of optimal design and ultimate hypothesis testing. Finally, we end with discussing the similarity and dissimilarity between the log Ratio and classical ANOVA models, including the rationality of model and optimal criteria. Hence, for gene expression microarrays, we discuss all components, from design before the experiment to analysis after the experiment. It is more thorough and settles several problems, which are imperatively waiting to be tackled. 2.1 log Ratio model Consider a two-color microarray experiment in which V varieties (designated as “treatment” by some authors and called “target” in the context of hybridization) are compared using S arrays. Because focusing on design structure, we first consider the single replicate case on each array for clarity (i.e. n, the number of observations for particular gene, equals S). In fact, this is equivalent to discuss the design structure with the equal replicates on each array (see section 5). Therefore, each spot on the array represents a particular gene. With each spot on the array where targets k (labeled with Cy5) and k ' (labeled with Cy3) are mixed and hybridized. There are associated two quantities R and G, which are the normalization-corrected intensities of red (Cy5) and green (Cy3) fluorescents. We assume that the intensities are proportional to the true expression levels, represented by kg and k ' g for variety k and k ' respectively, of the corresponding gene g. For simplicity, we neglect the suffix g in the following discussion and model R and G as follows: R R k 2 , G G k ' 2 , where k k' , (R) (G ) where R and G are the proportional factors of red and green dyes respectively. ( R ) and ( G ) are random error terms. Then the logarithmic ratio G log R G log Y log 2 R 2 log 2 k 2 k k' (R ) (G ) k' , G represents the dye effect, the Cy3/Cy5’s where the parameter log 2 R efficiency. If we let 1 , 2 , expression levels 1 , 2 , , V V be the geometric average of the true 1 , V of the V varieties, then Y log 2 k - log 2 k' k k' where the effects of interest, k log 2 k for k 1, ,V , reflects differences in expression for particular variety V and gene g combination that are not explained by the average effects of those varieties and genes. It is easy to see that Among them, there are only V-1 independent parameters 1 ,2 , 1 2 V k 1 k 0 . ,V -1 , and V V 1 . Furthermore, according to the observations, the assumption, (g ) is independent and distributes normally with mean zero and variance (2g ) , is reasonable because this assumption not only holds the general character of microarray data but also can evade the common error variance across genes to make the model more plausible. For each gene, let Y Y1 , ,Yn denote the vector with the n normalization– t corrected log-ratio, obtain from the corresponding spots on S arrays, as its component. The ordered set of the independent parameters in the experiments is given by the parameter vector 1 , V 1 , . Therefore, our model can be expressed as a t linear model form Y X , ~ N (0, 2 I ) , where 1 , 2 , , n is the vector of independent errors. X x1 ,x2 , t ,xn is the t n V design matrix describing how the targets are paired onto arrays. For instance, if target k (labeled with Cy5) and target k' (labeled with Cy3), k, k' V , are mixed and hybridized in array i, then the xi is the vector ( 0, k -th k' -th , 0, 1 , 0, , 0, -1 , 0, 1 ,2 , ,V 1 , 0, 1)t , Another type of xi occurs when target k V (labeled with Cy5/ Cy3) and target k' V (labeled with Cy3/ Cy5) are mixed and hybridized in array i , which is of the following form: k -th ( 1, , 1, 2 , 1, , 1, 1)t , if k ' V labeled with Cy3, 1 ,2 , ,V 1 k -th t ( -1, , -1, -2 , -1, , -1, 1) , if k ' V labeled with Cy5. 1 ,2 , ,V 1 Here, xi is called the design point. The set of all possible design points is called the regression range. The design matrix can be constructed according to the design points which are at the discretion of the experimenter, who can select them from regression range . Therefore, a design matrix or a “design” means choosing an arrangement of targets on arrays. 2.2 Design issue and p -optimality 1 In the above linear model, the least squared estimator of is ˆ X ' X X ' Y . An experimental question of interest can be described by a coefficient matrix K. According to the Gauss Markov theorem, the best liner unbiased estimator for K ' 1 is K ' ˆ K ' X ' X X ' Y which has dispersion matrix: 2 2 1 V K ' ˆ K ' X ' X K 2 n K ' M K n CK1 M , where M 1n X ' X n 1 n x x ' ,called the moment matrix of the design, i 1 i i M denotes some generalized inverse of M, CK1 M K' M K . In order to make the estimator 2 K ' ˆ less dispersive, we should make n CK1 M as “small” as possible. The term 2 n CK1 M can be separated into three parts 2 , n and CK1 M . The first part 2 is the variance of the experimental errors. The second part n is the number of arrays. The third part CK1 M CK1 1n X ' X depends on the design. To diminish the dispersion of K ' ˆ , we should decrease the variance of the experimental errors, increase the number of arrays, and select design structure cautiously. When the number of arrays is given, we have two different candidate designs, both make K ' estimable, with moment matrices M 1 and M 2 respectively. We prefer the former, if CK1 M1 is “smaller” than CK1 M 2 in some sense. The optimal design problem is that of how to choose a design with the “smallest” CK1 M . We need a criterion to measure the “smallness” of the matrix CK1 M . Qualified criteria are the p -optimality . For a matrix A PD(V ) , the set of all V V Positive Definite matrices, p A is defined as follows: max A 1s trace A p p CK M 1 det A s min A 1 P for p , for p 0, , for p 0, for p , where max and min are the maximum and minimum eigenvalues of A. Let A P t P be the eigenvalue decomposition. Therefore, we can obtain AP P t pP j V jP p j p tj trace AP j V jP tracep j p tj j V jP . The most eminent and commonly used criteria are D-optimality and A-optimality. Definition 2.2.1 A D-optimal design for K ' is a design whose moment matrix M * attains the supremum of 0 CK M . That is M * arg sup 0 CK M arg sup det CK M M M M M 1 s arg inf det CK1 M . M M Definition 2.2.2 An A-optimal design for K ' is a design whose moment matrix M * attains the supremum of 1 CK M . That is M * arg sup 1 CK M arg sup M M M M 1 s trace CK1 M 1 arg inf trace CK1 M . M M A well-known result that provides us a useful direction to obtain the theoretical upper bound is Mutual Bounded theorem (see Pukelsheim 1993): Let M and N be V V Positive Definite matrices, M M be a competing moment matrix and N N {A: xt Ax 1,x } be a cylinder set, then 1 , where p V q , p,q ,1 , p q pq. sup p CK M inf N N p K ' NK M M The right hand side of the inequality, inf 1 p K ' NK , is a theoretical upper bound. N N If we can find M , N such that p CK M * 1 p K ' N * K , then M *is a moment * * matrix of the optimal design and we say that its corresponding design is a p -optimal for K ' . 2.3 The theoretical upper bound in log Ratio model In this section, we consider the situation where all parameters are equally important. We take K to be I, the identity matrix, and use D-optimality as the criterion. In this case, 1 1 1 1 V1 1 0 K ' NK 0 N V 0 N V det A s where i , i 1, 2, ,V are the eigenvalues of N. Geometrically, the volume (up to a fixed factor) of the ellipsoid {x V 4 3 1 1 2 V 1 1 1 1 2 s , 1 1 2 V , xt Nx 1} . Therefore, is finding inf 1 0 N amounts to finding an ellipsoid with the minimum volume NN among the collection of all ellipsoids which cover the regression range . From the geometry of the regression range , we consider those ellipsoids that V 1 V 1 having (1, t ,0,1)t as its two principal axes. To determine the other V-2 ,1, 0) and ( 0, axes, we add V-2 independent vectors of the form 0, 0 to them and t , 0 ,1, -1, 0 , obtain the following matrix A. We then apply Gram-Schmidt process to the column vector of A and reach an orthogonal matrix P: 1 1 A 1 0 1 0 -1 1 0 -1 0 0 0 0 0 1 -1 0 0 0 0 0 Gram-Schmidt process P 0 1 V11 V11 V11 0 1 2 1 k 1 k 1 2 0 1 k 1 k k 1 k 0 0 0 0 0 0 0 0 0 1 k 1 1 k 1 k where the k-th column is ( , , 1 k 1 k , - k -1 k , 0)t , k 2,3, , 0, ,V 1 . k th Let N * P P t , where diagonal 1 ,2 , ,2 ,3 is a V V diagonal matrix. The diagonal entries 1 , 2 and 3 are to be determined by the constrained minimization problem: min 1 1 V 2 1 2 1 3 , subject to xt Nx 1, x . Note that the objective function in this problem is the volume (up to a factor) of the ellipsoid {x: xt N * x 1} . The solution of the above minimization problem is 1 V2V21 , 1 2 V2V1 , 3 V1 . Thus, inf N N 1 1 inf 1 N N (N ) V (det( N )) V 0 V V 2 V 1 2 V is the optimal design’s theoretical upper bound under log Ratio model, and this upper bound can be reached (see section 3). THEOREM. To compare the gene expression level in V varieties in the microarray experiment under the log Ratio model. If you consider that all varieties are equal interest, then the optimal theoretical upper bound of all possible experimental designs is 1 V V 1 2 sup 0 ( M ) inf N N ( N ) M M 0 V 1 2 V □ Then we could know the distance between any design you choose and optimal design by using this upper bound. 2.4 p K -efficiency The appropriate notation of efficiency is the following. Definition. The p K -efficiency of a design is defined by p K -efficiency p CK M , where p ( K ) inf 1 . N p ( K ' NK ) p K The number p K -efficiency is between 0and 1 (i.e. 0 p K -efficiency 1). Let 0 -efficiency 0 I -efficiency , since the upper bound can be reached. Namely, the 0 -efficiency could be rewritten as follows: 0 -efficiency p M , where sup 0 M inf 1 . N N 0 ( N ) M M sup 0 M M M Then we could use 0 -efficiency to evaluate if the design is an efficient design or not. In practical applications, a design in which 0 -efficiency can be reached 0.8~0.9 can be considered an efficient design. 3 Construction of an efficient design Because microarrays are expensive, to find the smaller designs which have high efficiency would be of great interest. In the meantime, for a larger V, we can recommend the method to construct efficient designs. 3.1 Graphical representation of experiment designs and a Search for optimal design Because when we construct the optimal design, there are important connections between the result and graphic representation. Hence we will first know the graphic representation’s meaning. One method to describe microarray experiment is to use directed graph (see Yang, Y. H. and Speed, T. 2002). The following three graphs which include nodes and edges illustrate three experiment designs: The nodes correspond to target mRNA samples and edges arrows correspond to hybridizations between two mRNA samples (varieties). We label the green dye (Cy3) at the tail and red dye (Cy5) at head of the arrow. In the other words, one arrow represent two mRNA samples that hybridize together on the same slide. One directed graph represents an entire set of microarray experimental design’s choice and represents a design matrix under the log Ratio model in the meantime. Graph2, 3 represent the S=V and S=V+1 optimal designs ( 0 -efficiency attain the maximum) whose design matrix are describe as below: 0 0 1 1 1 0 1 1 0 1 0 0 1 1 1 X 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 On the other hand, we will answer the question remained in the section 2. That 0 1 1 0 0 1 1 0 X 0 0 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 question is when we have enough budget for using unconstrained numbers of slides, whether the theoretical upper bound is reached or not. The equivalent description is whether the following equality holds under log Ratio model or not (the equality holds, 0 -efficiency 1 ) sup 0 ( M ) inf N M 1 . (N ) 0 The answer is when our structure of graphic representation of experimental design is a complete graph, the theoretical upper bound ( 0 -optimality, D-optimality) is reached. The complete graph is composed by 2 V (V 1) V 2 arrows, but when the dye’s effect could be balanced arranged, the graph is an even graph (i.e. every node ends up with the same number of heads and tails of arrows, and varieties are balanced with respect to dye). We could only use the half run let the upper bound be reached. We illustrated the point with a map. 0 -eff =1 0 - e f f =0.9554 0 -eff =1 However we have already mentioned that microarrays are expensive. So, for the reason of conserving slides’ budget, using V2 or V(V-1) slides to construct the optimal design is too luxurious. Hence, the experimenters have less interest to search the optimal design under unconstrained number of slides, and have the most interest to search the optimal design under fixed number of slides (like S=V+k or S=2V design) and to know whether the smaller designs are efficient enough. We discuss this problem in the next section. Recall that at the start of this section, one directed graph represents a set of arrangement of experimental design and represents a design matrix at the same time. Then, when we search all possible design matrices, it is equal to search all possible graph structures. Since the all parameters must be estimable, the graph must be connected. The challenge is determining the graph structure. Here, we search the optimal design under fixed number of slides by using C language. The result of that are illustrate as follows: <V=7,S=10>: <V=9,S=12>: <V=14,S=16>: Other optimal designs are attached in the appendix. In figure [1], we consider S=V+2 optimal designs because S=V+1 optimal designs are easy to guess in fewer tests. In figure [2], we provide S=2V optimal designs because the designs have a property of Eulerian circuit. 3.2 The simple method to obtain the efficient design Obviously, no matter how you consider the design’s graph structure. Searching all possible designs is infeasible for large V by using computer. For example, if you don’t ponder the graph’s structure to search S=2V optimal design, that is to say using the brute force method. The program will become simpler but it is unworkable because all counts of combination of design matrix are V (V 1) 2V . When V equal to 12, that is an astronomical figure approximately 1.360612e+026. Even though you ponder the graph structure’s special properties, that is also computational infeasible for large V. Therefore we propose two simple methods to obtain the high efficiency designs. Throughout these two methods, we could gain the design which approach optimal design or sometimes this design is optimal design. When we carefully observed the behavior of S=2V design, we found the important properties: if two design’s graph structures are the same then their algebraic structures will be isomorphic. That is to say they have the same 0 -eff and var- covariance matrix. In the other hand, if the two designs are algebraic isomorphs but in some situations we couldn’t represent them by using the identical graph structure. As example is V=6, S=2V optimal design: 0 –eff( )=0.9834 [ –eff( )與共變異數矩陣均相同] [ 0 –eff( )0 and var-covariance matrix are the same] The above two graphs are both optimal designs, but the left column has more patterns to trace. It consists of two graph’s rotation (triangle and hexagon). Therefore, we could use this result to construct S=kV efficient design in general. Let W be a set of factors of V and less then V 21 , Q is the number of elements of W . Then, you could pick k factors from Qk combinations. One combination of w {w1 ,w2 , ,wk } represents a particular S kV design and its graphic representation is constructed by rotating k V w1 , wV2 , , wVk polygon w1 ,w2 , ,wk times individually. However, it is a prerequisite that all parameters are estimable (i.e. ( X ' X ) 1 exist). For instance, when V=12, k=2, S=2V design, W={1,2,3,4}. We have 42 =6 selections. If we select w={2,3} to construct design, we could gain the design which is constructed by rotating the hexagon two times and square three times. This case just right as the optimal design (see appendix). 0 –eff( )=0.9392 They have the same 0 –eff( ) and var-covariance matrix. Besides, we instanced the S=2V, S=3V design as a V=16. w={1,4} 0 –eff( )=0.9161 w={1,2,4} 0 –eff( )=0.9459 In the preceding graph, we demonstrated that we must raise k appropriately when V increases. In the case of V=16, S=V optimal design’s 0 –eff( ) is 0.6951, hence we would consider S=2V to be an efficient design and its 0 –eff( ) is 0.9161. Here, the reason why we did not chose the S=3V design is that its improvement is not significantly compared to the S=2V case. Further, S=kV could be reworded as S= 2(k 1) V2 V j V2 V when V=2p is even. Therefore, we could define the other construction method of S=kV efficient design. We first construct a S=V loop design. Its internal network structure dictates that we draw j vertical symmetric lines and rotate V2 times, where k V 41 1 p 1 14 . The graph represented below illustrates V=10, S=3V design. ( j=2(k-1) j=4 To draw four vertical, symmetric lines and rotate five times). This method also has a high efficiency, such as V=6, 8, 10 are optimal designs. 0 –eff( )=0.9834 Generally speaking, if you want to construct a efficient design, the two essential rules must be regarded: [1] Balance all factors to ensure effects of interest are not completely confounded with other effect’s variation and let parameters of interest be estimable. [2] The numbers of arrows in the graph structure from any node to another node have to be as small as possible. The designs follow the two essential rules, whose graph structures are symmetric. We bring up two simple methods to construct S=kV efficient design to satisfy the afore-mentioned request and have a high efficiency. Further, consequential vital thing is that we can search high efficient design from the small search set. This narrows down the search combinations from V (V 1) kV 4 to or +1. Q k Q k Data Analysis Before we analyze the gene expression data, we must discuss something of vital importance. That is the connection of optimal design selection and the ultimate aim of testing a general linear hypothesis. Here, we cite an instance to interpret the point. If our ultimate aim is to test H 0 : l' r then we should construct the optimal design from 0 Cl M -optimality which would let l' X ' X l be a minimum. For instance, if we choose t -statistics to test a hypothesis 1 on l' ,it is t l' r s 2l' X ' X l . The 0 Cl M -optimal could make our 1 statistics more sensitive to detect the significant case. Hence, in general, if our ultimate ' ' aim is the testing of the M hypotheses, that is H 0 : i 1 lim i lim rm , m 1, V , M , or in matrix representation as H 0 : K ' r , sensitive simultaneously, we should construct the optimal design from 0 CK M -optimality. Accordingly, we should pay heed to the form of selection of the optimal design criterion because it correspond to the estimators of interest and the ultimate aim of testing a general linear hypothesis. To be useful in applications, we have a brief review of the problems of testing a general linear hypothesis and take Oleksiak et al. (2002)’s experimental design to demonstrate how to calculate testing statistics under the log Ratio model. Recall that with the normalized log Ratio model, for specific gene, the general linear model could be written as: Yn1 X nV V 1 n1 , Y ~ N ( X , 2 I ) , ~ rank( X ) V . And the testing a general linear hypothesis of interest is: H0 : K ' r vs H1 : K ' r , rank K V s . The V-s means that the M linear hypotheses comprise exact V-s linear restrictions. When s>0 (It will reduce to the simple ANOVA case when s=0), we choose an s V - matrix G' complementary to K ' such that the matrix G K V V is regular of rank V. Let t Y X X11 X 2 2 , where X 1 , X 2 = X G K '1 , 1' 2 ' ' = G K ' . ns n ( k s ) nk Therefore, we could recast as test: H0 : 2 r ; 1 and 2 0 arbitrary vs H1 : 2 r ; 1 and 2 0 arbitrary and the test statistic is F (b2 r ) ' D(b2 r )(n k ) , (Y Xb) '(Y Xb)rank( R) where b X ' X X ' y , b2 D 1 X 2' M 1Y , 1 D X 2 ' M1 X 2 , M1 I X1 ( X1' X1 )1 X1' I PX1 . Correspondingly, the reject region of H 0 is given by F F (V s, n V ) (see Rao & Toutenburg 1999). In Oleksiak et al. (2002)’s study, the design structure of graphic representation could look up their paper, they used S=2V (S=30) efficient loop design for the microarray experiment and examined 15 individuals (5 each from northern and southern populations of Fundulus heteroclitus and 5 from the sister taxon Fundulus grandis.) in order to determine the variation in gene expression within and between populations. Their ultimate experimental objective is to test the null hypothesis of no individual variation. Treatment can take on one of three values corresponding to the populations (northern F. heteroclitus, southern F. heteroclitus or F. grandis) and the alternative hypothesis, treatment takes on 15 distinct values, one for each individual. Therefore, under log Ratio model, the hypothesis is equivalent to H 0 : 1 4 13 , 2 5 14 , 3 6 15 vs H1 : not H 0 H0 : K ' r vs H1 : K ' r , where the form of K ' refer to the matrix (section 2.3). Hence, based on the microarray data and above-mentioned test rule, we could calculate the F-statistics respectively for each specific gene under the normalized log Ratio model and compare them to the F 12 , ng 15 distribution. Then we could obtain the p-value of expression level to detect the significant case. Literally, you could marshal numerous hypotheses which represent especial genetic meanings separately to test via creating the form of K ' properly. Basically, the comparisons between all pairs of varieties, all elementary contrasts, are regularly interesting to biologists. Then, we also demonstrate the form of K ' as below: 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 2 1 1 1 1 1 1 1 L' 0 0 0 1 1 0 0 0 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 0 0 0 0 0 0 0 V (V 1)V Thus, you could test which genes’ all elementary contrasts are significant. For specific genes, if the result is significant, you could use the multiple comparison method to trace that which elementary contrasts have significant differences. Finally, by the way of testing the hypothesis, if you fix the threshold to determine whether it is significant or not then we could obtain the particular genes significantly different for each individual. We could record those particular genes or total percentage to explore the genetic meaning or influence for evidencing some characteristic or hypothesis of evolution. 5 Discussion In this section, our principal points are to compare similarities and dissimilarities between Kerr et al. (2000)’s ANOVA model and the log Ratio model we propose. We discuss the log Ratio model’s rationality from the ANOVA model’s viewpoint and the judgment viewpoint of optimal design at first. Furthermore, we discuss the relationship of optimal criteria between the two models. 5.1 The rationality of model establishment and the judgment viewpoint of optimal design According to the experimental factors, Kerr et al. (2001) built the ANOVA model. They identify four basic factors: varieties, genes, dyes and arrays, which are represented by Vk i , j , Gg , D j , Ai . zijk i , j gr is represented the fluorescent intensity from array i and dye j representing variety k and gene g. The model is wijk i , j gr =log 2 zijk i , j gr m Ai D j AD ij Gg AG ig DG jg VG k i , j g ijk i , j gr . global effects gene specific effects Base on Kerr et al.’s model, the log Ratio could be express in the form log 2 Ratioigr =log 2 zi 2 k i ,2 gr zi1k i ,1 gr = D2 D1 AD i 2 AD i1 DG 2 g DG 1g VG k i ,2 g VG k i ,1 g i 2 k i ,2 gr i1k i ,1 gr i g [ k (i , 2 ) g k (i ,1) g ] gri i Ci g k2 g k1g gr , where Ci i . On the other hand, in fact, the full log Ratio model due to the raw data without the normalization procedure can express as below: M gr M gr normalized log 2 Ratioigr log 2 i A i g kg k' g gr , or we could rewrite as log 2 Ratioigr i log 2 i A i g kg k' g gr 1 normalized 1 (raw) (raw) (raw ) (raw ) where Ratioigr Rigr , A log 2 Rigr Gigr , R(raw) and G (raw ) represent the Gigr raw data of red and green fluorescent intensity. The log2 i A term, which means that we have different estimates of means for different intensity ranges within the array i, can be estimated by local regression intensity-dependent normalization method (lowess fit), i can be estimated by scale normalization method (median absolute deviation MAD) (see Yang et al. (2002)). Therefore, the log Ratio of ANOVA model is definitely similar to the log Ratio model when the log Ratio model’s scale parameter i 1 and location parameter log2i A Ci which is estimated by mean of the log ratios of the i-th array. i log 2 y 2 y log 2 y 1 y log 2 y i 2 k( i ,2 ) y y i y 2 log 2 y i1k( i ,1 ) y y i y1 i i i log 2 y i 2 k( i ,2 ) y i1k( i ,1 ) log 2 R G m C i log 2 i A i . Thus it can be seen, the establishment of the log Ratio model is rational even if it is in terms of the ANOVA model’s viewpoint. By the way, if we set i 1 , then it means 1 that the data are without scale normalization. Moreover, if we also assume log 2i A Ci is a constant factor and estimated by the mean of the log ratios of the ith array, then it simultaneously reduces to global normalization (see Yang et al. (2002)). In addition, for the sake of probing into the judgment viewpoints of optimal design between us and Kerr et al., we must first analyze the structure of the design space. It has some essential properties under ANOVA model: [1] A is orthogonal to D. [2] A and D are partially confounded when V>2. [3] D and V could be completely confounded, partially confounded or orthogonal. It hinges on the design you select. If D and V are balanced then D is orthogonal to V. [4] (DV), (AV) and (AD) are confounded with A, D and V respectively. [5] G is orthogonal with to A, D and V. This property implies global effects are orthogonal to gene-specific effects. [6] (VG) and (AG) are partially confounded as V>2 but log Ratio model we propose could solve this critical problem. [7] The (DG) effects could be estimated and inserted in the ANOVA model. Further, if D and V are balanced, then (DG) and (VG) are orthogonal. In fact, the (DG) and (VG) are orthogonal if and only if the design is an even design. [8] (DG) is orthogonal to (AG). On the basis of properties [1]-[8], the above-mentioned ANOVA model considers all main effects and two-factor interaction terms obviously. According to property [7], that is reason why Kerr et al. search optimal design under even design set if they use the above-mentioned model including (DG)’s effects. However, empirically, the search of the result of the S=kV optimal design will be even designs, but the S=V+k optimal design is not necessarily even designs any longer. Nevertheless, it is difficult to act under ANOVA model without the condition of even design because it will let their inference of optimal criterion become more complex. However, we overcome this problem in this paper under the log Ratio model. The principal cause is that the log Ratio model lets (AG)s’ effects be removed. Through the analysis of the structure of design space, we have vital conclusions that deserve to be mentioned. That is, if two factor effects are orthogonal in the design space, then removal of any one factor doesn’t have an influence on the estimation of the other effects. That is to say, we don’t lose any information when we estimate another effect’s parameter. Based on the features of design space, we discuss the model’s rationality from Kerr et al‘s point of view. On the strength of property [5], ANOVA model’s global effects are orthogonal to the gene specific effects in the structure of design space. Hence, if the global effects are removed from the response, they don't influence the information of the estimation of gene specific effects (this step is quiet global normalization process). Furthermore, Kerr et al. search optimal design under even design set. Then, according to property [7], we could also remove them. Thus, we could fix gene to rewrite the model as ' wijk i , j gr wijk i , j gr global effects DG jg Gg AG ig VG k i , j g ijk i , j gr , i 1, ,n, j 1,2. It can be represented in the form for some gene ' ' wikr G AG i VG k ikr wikr bi k ikr , where bi and k could be looked upon as block i and treatment k. Owning to experimental restriction, it isn’t possible to use size two blocks accommodating all the treatments in each block when treatment size is bigger than block size 2. The treatment effects and block effects are partially confounded. That is incomplete block design. That is the reason why they could use some general properties of incomplete bock design to obtain the A-optimality of the parameters of interest under ANOVA model (see Raghavarao 1971). For the same reason, under log Ratio model, the effects of log 2i A and i are due to global effects actually. Therefore, we could remove the global effects and fixed gene to construct our model: The normalized log Ratio model: M gr g k2 g k1 g gr . Based on this model, we could obtain the 0 -optimality (D-optimality) by using the properties of the optimal theory of the general linear model (see section 2). 5.2 The relationship of optimal criteria In this paragraph, under the log Ratio model, we discuss the criterion of A-optimality that Kerr et al. provided to evaluate the design. The criterion minimizes the average variance of all elementary contrasts of interest. In fact, this criterion is a particular p CK M -optimality for p=-1, K=L. 1 1 ˆ (V ) var( k2 g k1g ) = (V ) tr(V ( L ' )) 2 2 k1 k2 1 2 2 2 n = 1 V tr( L '( X ' X ) 1 L) n = 1 V tr( n L ' ML) = ( ) ( ) 1 CL M 2 2 2 1 2 n , = 1 V V tr( J '( X ' X ) 1 J ) E1,V ( J '( X ' X ) 1 J ) EV ,1 n = V ( ) 1 CJ M 2 0 from which we could know min 1 V var( k2 g k1g ) max 1 CL M max 1 CJ M . M M ( ) 2 k1 k2 Thus, if we want to minimize the variance of all elementary contrasts, we only need to ensure the minimization of the variance of the estimator k1 k . Hence, we could V know the criterion max 1 CJ M Kerr et al. provided only consider all kg 's effects, M and max 0 CI M det CI M V we proposed consider all gene specific effects 1 M including the g and kg 's effects. The model and criterion we proposed are rational and do not conflicting with Kerr et al.’s viewpoint. Besides, our model has many merits. First of all, we could conquer the problem of the inference of the theoretical upper bound of the optimal design under the log Ratio model because the model is trimmed and streamlined an organization of the structure of design space. Moreover, the optimal criterion considers all variance of gene specific effects. Further, its normalization processes such as nonlinear location and scale normalization could be considered more plausible than the ANOVA model. The results of S=2V optimal designs, which use the 0 -optimality criterion under log Ratio model are the same with Kerr et al. (2000), which use A-optimality under classical ANOVA model (see appendix). If you set a constraint to search the optimal design within “even” designs, the result of S=V+2 optimal design will be the same. Thus it can be seen, although we and Kerr et al. start with different viewpoints and using different optimal theory to derive optimal criterion for evaluating design under different model, the results still have consistence. [Appendix] The 0 -optimality design: S=V+2 optimal design: V=6 V=7 V=8 V=9 V=10 V=11 V=12 V=13 figure [1] S=2V optimal design: V=5 V=6 V=7 V=8 V=9 V=10 V=11 V=12 figure [2] Reference: [1] Churchill, G. A. (2002). Fundamentals of experimental design for cDNA microarrays. Nature genetics supplement. vol. 32, 490-495. [2] Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatter plots. J. Amer. Statist. Assoc. 74,829-836. [3] Hwang, F. K. (2001). A complementary survey on double-loop networks. Theoretical computer science, 263 211-229. [4]Ideker, T., Thorsson, V., Siegel, A. F. and Hood, L. E.(2000). Testing for differentially-expressed genes by maximum-likelihood analysis of Microarray Data. J. Computat.Graph. Statist. 5, 299-314. [5] John, Peter W. M. (1980). Incomplete Block Designs. New York & Basel / Marcel Dekker. [6] Kerr, M. K. and Churchill, G. A. (2001). Experimental design for gene expression microarrays. Biostatistics 2, 183-201. [7] Kerr, M. K. and Churchill, G. A. (2001). Statistical design and the analysis of expression microarray data. Genet. Res. 77, 123-128. [8] Kerr, M.K., Martin, M., and Churchill, G.A. (2000). Analysis of variance for gene expression microarray data. Journal of Computational Biology, to appear. [9] Luenberger, David G. (1968). Optimization by Vector Space Methods. John Wiley & Sons, New York. [10] Oleksiak, M. F., Churchill, G. A. and Crawford D. L. (2002).Variation in gene expression within and among natural populations. Nature genetics. vol. 32,261-266 [11] Pukelsheim, Friedrich (1993). Optimal Design of Experiments. John Wiley & Sons, New York. [12] Quackenbush, J. (2002). Microarray data normalization and transformation. Nature genetics supplement. vol. 32, 496-501. [13] Raghavarao, D. (1971). Constructions and Combinatorial Problems in Design of Experiments. New York: Wiley. [14] Rao, C. R. and Toutenburg, H. (1999). Linear Model. Springer, New York. [15] Schuchhardt, J., Beule, D., Malik, A., Wolski, E., Eickhoff, H., Lehrach, H., and Herzel, H. (2000). Normalization strategies for cDNA microarrays. Nucleic Acids Research, vol. 28, No. 10. [16] Wu, H., Kerr, M.K., Cui, X. and Churchill, G.A. (2002).MANOVA: a software package for the analysis of spotted cDNA microarray experiments. to appear. [17] Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J. and Speed, T. P. (2002). Normalization for cDNA microarray data:a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, vol. 30, No. 4 e15. [18] Yang, Y. H. and Speed, T. (2002). Design issues for cDNA microarray experiments. Nature genetics. vol. 3 579-588.