Optimal Design for Gene Expression Microarrays

advertisement
Optimal Design for Gene Expression Microarrays
Wun-Yi Shu
Institute of Statistics
National Tsing Hua University
Yang-Chao Wang
Institute of Statistics
National Tsing Hua University
Abstract
We propose a statistical method for constructing an optimal or efficient design for
gene expression microarrays by using the linear model’s optimal theory. An essential
aspect of experimental design is to optimize the efficiency of the estimation of the
unknown parameters’ contrasts using observations generated from that design. We
derive a selection criterion, 0 -efficiency , to measure the goodness of any chosen
design. The method works by obtaining the theoretical upper bound of the optimality.
Besides, we discuss the connection between experimental design and its graphic
representation and develop a procedure to construct an efficient design when the
number of the varieties is large. Furthermore, we discuss the model’s rationality and
the relationship of optimal criterion between log Ratio model we propose and
ANOVA model (Kerr et al. (2000)).
1
Introduction:
The development of microarray technology produces massive gene expression
data sets. A major task for the experimentalist is to understand the structure in the
huge data sets. Data generated by a scientific experiment always contain random
noise. The situation is worst in the area of biology. Statistical methods must be used to
accurately interpret large-scale experimental data. In the last decade, most
publications considering statistical problems in the context of microarray experiments
focused on the techniques of data analysis. Kerr et al. [6] was the first that brought
design issues for microarray experiments to attention. Some of the key questions in
microarray experiments are: Given limited resources (e.g., the number of slides is
limited), how can one gain as much information as possible? How does one achieve
the goal of the experiment using as few slides as possible? What is the effect of
missing arrays on a design (see Bretz et al. (2003))? An appropriate design makes the
estimation of interested parameters more precise and the statistical tests more
sensitive in detecting significant cases.
Recently, there are many papers have been published, discussing the issues of how
to model the microarray data to describe the gene expression level and propose a
proper criterion to evaluate the design of experiment. One of the most representative
papers is Kerr et al. [6]. Their group, the Jackson laboratory, has a series of papers to
discuss and apply this model (Kerr et al. [6, 7, 8] and Churchill [1], Oleksiak et al.
[10], Wu et al. [16]) which are vital reference materials for biologists who are
interested in gene expression microarrays. Their main ideas are to use classical
ANOVA model to describe data and some incomplete block design theory
(Raghavarao [13]) to derive their optimal criterion. Nevertheless, in their optimality
criterion score, they couldn’t provide the theoretical upper bound of this score for any
number of varieties. Accordingly, in our paper, the major tasks are to propose a log
Ratio model which is a more plausible model for microarray data and overcome the
knotty problem of theoretical upper bound. Therefore, if a particular design is given,
one could know its efficiency compared with the optimal design (which is unknown)
using the same number of slides.
In this paper, we start with the establishment of the log Ratio model, propose the
optimal criterion, and derive the theoretical upper bound. Then, we introduce the
graphical representation of designs, provide some computational results of common
use optimal designs and supply the method to construct efficient designs for large
numbers of varieties. Moreover, we discuss the connection between the selection of
optimal design and ultimate hypothesis testing. Finally, we end with discussing the
similarity and dissimilarity between the log Ratio and classical ANOVA models,
including the rationality of model and optimal criteria. Hence, for gene expression
microarrays, we discuss all components, from design before the experiment to
analysis after the experiment. It is more thorough and settles several problems, which
are imperatively waiting to be tackled.
2.1 log Ratio model
Consider a two-color microarray experiment in which V varieties (designated as
“treatment” by some authors and called “target” in the context of hybridization) are
compared using S arrays. Because focusing on design structure, we first consider the
single replicate case on each array for clarity (i.e. n, the number of observations for
particular gene, equals S). In fact, this is equivalent to discuss the design structure
with the equal replicates on each array (see section 5). Therefore, each spot on the
array represents a particular gene. With each spot on the array where targets k (labeled
with Cy5) and k ' (labeled with Cy3) are mixed and hybridized. There are associated
two quantities R and G, which are the normalization-corrected intensities of red (Cy5)
and green (Cy3) fluorescents. We assume that the intensities are proportional to the
true expression levels, represented by  kg and k ' g for variety k and k ' respectively, of
the corresponding gene g. For simplicity, we neglect the suffix g in the following
discussion and model R and G as follows:
R  R k 2 , G  G k ' 2 , where k  k' ,
(R)
(G )
where R and G are the proportional factors of red and green dyes respectively.  ( R ) and
 ( G ) are random error terms. Then the logarithmic ratio
 G   log R G   log  
 Y  log 2 R
2
   log 2 

k
2
k
k'    

(R )
  (G ) 

k'    ,
 G  represents the dye effect, the Cy3/Cy5’s
where the parameter   log 2 R
efficiency. If we let    1 , 2 ,
expression levels 1 , 2 ,
, V V be the geometric average of the true
1
, V of the V varieties, then
 


Y    log 2  k  - log 2  k'       k  k'  

 

 
where the effects of interest, k  log 2  k  for k  1,
 
,V , reflects differences in
expression for particular variety V and gene g combination that are not explained by
the average effects of those varieties and genes. It is easy to see that
Among them, there are only V-1 independent parameters 1 ,2 ,
  1  2 

V
k 1
k  0 .
,V -1 , and V 
 V 1  . Furthermore, according to the observations, the assumption,
 (g ) is independent and distributes normally with mean zero and variance  (2g ) , is
reasonable because this assumption not only holds the general character of microarray
data but also can evade the common error variance across genes to make the model
more plausible.
For each gene, let Y  Y1 ,
,Yn  denote the vector with the n normalization–
t
corrected log-ratio, obtain from the corresponding spots on S arrays, as its component.
The ordered set of the independent parameters in the experiments is given by the
parameter vector    1 ,
V 1 ,   . Therefore, our model can be expressed as a
t
linear model form
Y  X    ,  ~ N (0,  2 I ) ,
where    1 , 2 ,
, n  is the vector of independent errors. X   x1 ,x2 ,
t
,xn  is the
t
n V design matrix describing how the targets are paired onto arrays. For instance, if
target k (labeled with Cy5) and target k' (labeled with Cy3), k, k'  V , are mixed and
hybridized in array i, then the xi is the vector
( 0,
k -th
k' -th
, 0, 1 , 0,
, 0, -1 , 0,
1 ,2 , ,V 1
, 0, 1)t ,

Another type of xi occurs when target k  V (labeled with Cy5/ Cy3) and target k'  V
(labeled with Cy3/ Cy5) are mixed and hybridized in array i , which is of the
following form:
k -th

( 1, , 1, 2 , 1, , 1, 1)t , if k '  V labeled with Cy3,


1 ,2 , ,V 1

k -th

t
( -1, , -1, -2 , -1, , -1, 1) , if k '  V labeled with Cy5.

1 ,2 , ,V 1

Here, xi is called the design point. The set  of all possible design points is called the
regression range. The design matrix can be constructed according to the design points
which are at the discretion of the experimenter, who can select them from regression
range  . Therefore, a design matrix or a “design” means choosing an arrangement of
targets on arrays.
2.2 Design issue and  p -optimality
1
In the above linear model, the least squared estimator of  is ˆ   X ' X  X ' Y .
An experimental question of interest can be described by a coefficient matrix K.
According to the Gauss Markov theorem, the best liner unbiased estimator for K ' 
1
is K ' ˆ  K '  X ' X  X ' Y which has dispersion matrix:


2
2
1
V K ' ˆ  K '  X ' X  K 2  n K ' M  K  n CK1  M  ,
where M  1n X ' X 
n
1
n
 x x ' ,called the moment matrix of the design,
i 1
i i
M  denotes
some generalized inverse of M, CK1  M   K' M  K . In order to make the estimator
2
K ' ˆ less dispersive, we should make n CK1  M  as “small” as possible. The term
2
n
CK1  M  can be separated into three parts  2 , n and CK1  M  . The first part  2 is the
variance of the experimental errors. The second part n is the number of arrays. The
third part CK1  M   CK1  1n X ' X  depends on the design. To diminish the dispersion of
K ' ˆ , we should decrease the variance of the experimental errors, increase the
number of arrays, and select design structure cautiously.
When the number of arrays is given, we have two different candidate designs,
both make K '  estimable, with moment matrices M 1 and M 2 respectively. We prefer
the former, if CK1  M1  is “smaller” than CK1  M 2  in some sense. The optimal design
problem is that of how to choose a design with the “smallest” CK1  M  . We need a
criterion to measure the “smallness” of the matrix CK1  M  .
Qualified criteria are the  p -optimality . For a matrix A  PD(V ) , the set of all
V V Positive Definite matrices,  p  A is defined as follows:
 max  A

 1s trace  A p 
 p  CK  M    
1
  det  A  s

 min  A


1
P
for p  ,
for p  0, ,
for p  0,
for p  ,
where max and min are the maximum and minimum eigenvalues of A. Let A  P t  P be
the eigenvalue decomposition. Therefore, we can obtain
AP  P t pP   j V  jP p j p tj
 trace  AP    j V  jP tracep j p tj   j V  jP .
The most eminent and commonly used criteria are D-optimality and A-optimality.
Definition 2.2.1 A D-optimal design for K '  is a design whose moment matrix M *
attains the supremum of 0  CK  M   . That is

M *  arg sup 0  CK  M    arg sup det  CK  M  
M M
M M

1
s
 arg inf det  CK1  M   .
M M
Definition 2.2.2 An A-optimal design for K '  is a design whose moment matrix M *
attains the supremum of 1  CK  M   . That is
M *  arg sup 1  CK  M    arg sup
M M
M M

1
s
trace  CK1  M  

1
 arg inf trace  CK1  M   .
M M
A well-known result that provides us a useful direction to obtain the theoretical
upper bound is Mutual Bounded theorem (see Pukelsheim 1993):
Let M and N be V V Positive Definite matrices, M M be a competing moment
matrix and N  N {A: xt Ax  1,x  } be a cylinder set, then
1
, where  p  V  q , p,q  ,1 , p  q  pq.
sup  p  CK  M    inf 
N

N
 p  K ' NK 
M M
The right hand side of the inequality, inf 1  p  K ' NK  , is a theoretical upper bound.

N N

If we can find M , N such that  p CK  M *   1  p  K ' N * K  , then M *is a moment
*
*
matrix of the optimal design and we say that its corresponding design is a  p -optimal
for K '  .
2.3 The theoretical upper bound in log Ratio model
In this section, we consider the situation where all parameters are equally
important. We take K to be I, the identity matrix, and use D-optimality as the criterion.
In this case,
1
1
1
1
 


  V1 
1

0  K ' NK  0  N  V  0  N  V   det  A  s
where i , i  1, 2,
,V are the eigenvalues of N. Geometrically,
the volume (up to a fixed factor) of the ellipsoid {x 
V


4
3

1
1
2
V

1
1
1
1


2
s
,
1
1
2
V
, xt Nx  1} . Therefore,
 is
finding inf 1 0  N  amounts to finding an ellipsoid with the minimum volume
NN
among the collection of all ellipsoids which cover the regression range  .
From the geometry of the regression range  , we consider those ellipsoids that
V 1
V 1
having (1,
t
,0,1)t as its two principal axes. To determine the other V-2
,1, 0) and ( 0,
axes, we add V-2 independent vectors of the form
 0,
0  to them and
t
, 0 ,1, -1, 0 ,
obtain the following matrix A. We then apply Gram-Schmidt process to the column
vector of A and reach an orthogonal matrix P:
1
1



A



1

0
1 0
-1 1
0 -1
0
0
0
0
0
1
-1
0
0
0
0
0 

 Gram-Schmidt process
 P 



0

1 
 V11

 V11








 V11

 0
1
2
1
k 1  k
1
2
0
1
k 1  k
 k 1
k
0
0
0
0
0
0
0

0








0

1 
k 1
1
k 1  k
where the k-th column is (
,
,
1
k 1  k
,
- k -1
k
, 0)t , k  2,3,
, 0,
,V  1 .
k th
Let N *  P P t , where   diagonal  1 ,2 ,
,2 ,3  is a V V diagonal
matrix. The diagonal entries 1 , 2 and 3 are to be determined by the constrained
minimization problem:
min
1
1

 
V 2
1
2

1
3
, subject to xt Nx  1, x   .
Note that the objective function in this problem is the volume (up to a factor) of the
ellipsoid {x: xt N * x  1} . The solution of the above minimization problem is 1  V2V21 ,
1
2  V2V1 , 3  V1 . Thus, inf
N N
1
1
 inf
1 
N

N
 (N )
V (det( N )) V

0
 V V
 
2
 V  1 
  
 2  V 
is the optimal
design’s theoretical upper bound under log Ratio model, and this upper bound can be
reached (see section 3).
THEOREM. To compare the gene expression level in V varieties in the
microarray experiment under the log Ratio model. If you consider that all varieties are
equal interest, then the optimal theoretical upper bound of all possible experimental
designs is
1
 V V
 
1
 2
sup 0 ( M )  inf 
N N  ( N )
M M
0
 V  1 
  
 2  V 
□
Then we could know the distance between any design you choose and optimal design
by using this upper bound.
2.4  p  K  -efficiency
The appropriate notation of efficiency is the following.
Definition. The  p  K  -efficiency of a design is defined by
 p  K  -efficiency 
 p  CK  M  
, where  p ( K )  inf 1 
.
N
 p ( K ' NK )
p K 
The number  p  K  -efficiency is between 0and 1 (i.e. 0   p  K  -efficiency  1).
Let 0 -efficiency  0  I  -efficiency , since the upper bound can be reached. Namely,
the 0 -efficiency could be rewritten as follows:
0 -efficiency 
p  M 
, where sup 0  M   inf 1 
.
N N
0 ( N )
M M
sup 0  M 
M M
Then we could use 0 -efficiency to evaluate if the design is an efficient design or not.
In practical applications, a design in which 0 -efficiency can be reached 0.8~0.9 can be
considered an efficient design.
3
Construction of an efficient design
Because microarrays are expensive, to find the smaller designs which have high
efficiency would be of great interest. In the meantime, for a larger V, we can
recommend the method to construct efficient designs.
3.1 Graphical representation of experiment designs and a Search for
optimal design
Because when we construct the optimal design, there are important connections
between the result and graphic representation. Hence we will first know the graphic
representation’s meaning. One method to describe microarray experiment is to use
directed graph (see Yang, Y. H. and Speed, T. 2002). The following three graphs
which include nodes and edges illustrate three experiment designs:
The nodes correspond to target mRNA samples and edges arrows correspond to
hybridizations between two mRNA samples (varieties). We label the green dye (Cy3)
at the tail and red dye (Cy5) at head of the arrow. In the other words, one arrow
represent two mRNA samples that hybridize together on the same slide. One directed
graph represents an entire set of microarray experimental design’s choice and
represents a design matrix under the log Ratio model in the meantime. Graph2, 3
represent the S=V and S=V+1 optimal designs ( 0 -efficiency attain the maximum)
whose design matrix are describe as below:
0
0 1
 1 1
 0 1 1
0 1

 0 0  1 1 1
X 

 1  1  1  2 1
2 1 1
1 1


2 1
1 1
 1
On the other hand, we will answer the question remained in the section 2. That
0
 1 1 0
 0 1 1
0

X   0 0 1 1

 1  1  1  2
 2 1 1
1
1
1
1

1
1
question is when we have enough budget for using unconstrained numbers of slides,
whether the theoretical upper bound is reached or not. The equivalent description is
whether the following equality holds under log Ratio model or not (the equality holds,
0 -efficiency  1 )
sup 0 ( M )  inf
N
M
1
.
 (N )

0
The answer is when our structure of graphic representation of experimental design is a
complete graph, the theoretical upper bound ( 0 -optimality, D-optimality) is reached.
The complete graph is composed by
  2  V (V  1)
V
2
arrows, but when the dye’s
effect could be balanced arranged, the graph is an even graph (i.e. every node ends up
with the same number of heads and tails of arrows, and varieties are balanced with
respect to dye). We could only use the half run let the upper bound be reached. We
illustrated the point with a map.
0 -eff   =1
0 - e f f  =0.9554
0 -eff   =1
However we have already mentioned that microarrays are expensive. So, for the
reason of conserving slides’ budget, using V2 or V(V-1) slides to construct the
 
optimal design is too luxurious. Hence, the experimenters have less interest to search
the optimal design under unconstrained number of slides, and have the most interest
to search the optimal design under fixed number of slides (like S=V+k or S=2V
design) and to know whether the smaller designs are efficient enough. We discuss this
problem in the next section.
Recall that at the start of this section, one directed graph represents a set of
arrangement of experimental design and represents a design matrix at the same time.
Then, when we search all possible design matrices, it is equal to search all possible
graph structures. Since the all parameters must be estimable, the graph must be
connected. The challenge is determining the graph structure. Here, we search the
optimal design under fixed number of slides by using C language.
The result of that are illustrate as follows:
<V=7,S=10>:
<V=9,S=12>:
<V=14,S=16>:
Other optimal designs are attached in the appendix. In figure [1], we consider S=V+2
optimal designs because S=V+1 optimal designs are easy to guess in fewer tests. In
figure [2], we provide S=2V optimal designs because the designs have a property of
Eulerian circuit.
3.2 The simple method to obtain the efficient design
Obviously, no matter how you consider the design’s graph structure. Searching all
possible designs is infeasible for large V by using computer. For example, if you don’t
ponder the graph’s structure to search S=2V optimal design, that is to say using the
brute force method. The program will become simpler but it is unworkable because all
counts of combination of design matrix are

V (V 1)
2V
 . When V equal to 12, that is an
astronomical figure approximately 1.360612e+026. Even though you ponder the
graph structure’s special properties, that is also computational infeasible for large V.
Therefore we propose two simple methods to obtain the high efficiency designs.
Throughout these two methods, we could gain the design which approach optimal
design or sometimes this design is optimal design.
When we carefully observed the behavior of S=2V design, we found the important
properties: if two design’s graph structures are the same then their algebraic structures
will be isomorphic. That is to say they have the same 0 -eff   and var- covariance
matrix. In the other hand, if the two designs are algebraic isomorphs but in some
situations we couldn’t represent them by using the identical graph structure. As
example is V=6, S=2V optimal design:
 0 –eff(  )=0.9834
[  –eff(  )與共變異數矩陣均相同]
[  0 –eff(  )0 and var-covariance matrix are the same]
The above two graphs are both optimal designs, but the left column has more patterns
to trace. It consists of two graph’s rotation (triangle and hexagon). Therefore, we
could use this result to construct S=kV efficient design in general.
Let W be a set of factors of V and less then V 21 , Q is the number of elements of W .
Then, you could pick k factors from  Qk  combinations. One combination of w  {w1 ,w2 ,
,wk } represents a particular S  kV design and its graphic representation is constructed
by rotating k

V
w1
, wV2 ,

, wVk polygon w1 ,w2 ,
,wk  times individually.
However, it is a prerequisite that all parameters are estimable (i.e. ( X ' X ) 1 exist).
For instance, when V=12, k=2, S=2V design, W={1,2,3,4}. We have 42 =6 selections.

If we select w={2,3} to construct design, we could gain the design which is
constructed by rotating the hexagon two times and square three times. This case just
right as the optimal design (see appendix).
 0 –eff(  )=0.9392
They have the same  0 –eff(  ) and var-covariance matrix. Besides, we instanced the
S=2V, S=3V design as a V=16.
w={1,4}
 0 –eff(  )=0.9161
w={1,2,4}
 0 –eff(  )=0.9459
In the preceding graph, we demonstrated that we must raise k appropriately when V
increases. In the case of V=16, S=V optimal design’s  0 –eff(  ) is 0.6951, hence we
would consider S=2V to be an efficient design and its  0 –eff(  ) is 0.9161. Here, the
reason why we did not chose the S=3V design is that its improvement is not
significantly compared to the S=2V case.
Further, S=kV could be reworded as S= 2(k  1)  V2  V  j  V2  V when V=2p is
even. Therefore, we could define the other construction method of S=kV efficient
design. We first construct a S=V loop design. Its internal network structure dictates
that we draw j vertical symmetric lines and rotate V2 times, where k  V 41   1
p 1
 14  .
The graph represented below illustrates V=10, S=3V design. ( j=2(k-1)  j=4  To
draw four vertical, symmetric lines and rotate five times). This method also has a high
efficiency, such as V=6, 8, 10 are optimal designs.
 0 –eff(  )=0.9834
Generally speaking, if you want to construct a efficient design, the two essential
rules must be regarded: [1] Balance all factors to ensure effects of interest are not
completely confounded with other effect’s variation and let parameters of interest be
estimable. [2] The numbers of arrows in the graph structure from any node to another
node have to be as small as possible. The designs follow the two essential rules,
whose graph structures are symmetric. We bring up two simple methods to construct
S=kV efficient design to satisfy the afore-mentioned request and have a high
efficiency. Further, consequential vital thing is that we can search high efficient
design from the small search set. This narrows down the search combinations from

V (V 1)
kV
4
 to  or  +1.
Q
k
Q
k
Data Analysis
Before we analyze the gene expression data, we must discuss something of vital
importance. That is the connection of optimal design selection and the ultimate aim of
testing a general linear hypothesis.
Here, we cite an instance to interpret the point. If our ultimate aim is to test H 0 : l'   r
then we should construct the optimal design from 0  Cl  M   -optimality which would
let l'  X ' X  l be a minimum. For instance, if we choose t -statistics to test a hypothesis
1

on l'  ,it is t  l'   r

s 2l'  X ' X  l . The 0  Cl  M   -optimal could make our
1
statistics more sensitive to detect the significant case. Hence, in general, if our ultimate
'
'
aim is the testing of the M hypotheses, that is H 0 :  i 1 lim
i  lim
  rm , m  1,
V
,
M , or in matrix representation as H 0 : K '   r , sensitive simultaneously, we should
construct the optimal design from 0  CK  M   -optimality. Accordingly, we should pay
heed to the form of selection of the optimal design criterion because it correspond
to the estimators of interest and the ultimate aim of testing a general linear hypothesis.
To be useful in applications, we have a brief review of the problems of testing a
general linear hypothesis and take Oleksiak et al. (2002)’s experimental design to
demonstrate how to calculate testing statistics under the log Ratio model. Recall that
with the normalized log Ratio model, for specific gene, the general linear model could
be written as:
Yn1  X nV V 1   n1 , Y ~ N ( X  ,  2 I ) ,
~
rank( X )  V .
And the testing a general linear hypothesis of interest is:
H0 : K '   r vs H1 : K '   r , rank  K   V  s .
The V-s means that the M linear hypotheses comprise exact V-s linear restrictions.
When s>0 (It will reduce to the simple ANOVA case when s=0), we choose an s V -
matrix G' complementary to K ' such that the matrix G K V V is regular of rank V. Let
t


Y  X     X11  X 2 2   , where  X 1 , X 2  = X G K  '1 ,  1'  2 ' ' = G K  '  .
 ns n ( k  s )  nk
Therefore, we could recast as test: H0 : 2  r ; 1 and  2  0 arbitrary vs H1 :  2  r ;
1 and  2  0 arbitrary and the test statistic is
F
(b2  r ) ' D(b2  r )(n  k )
,
(Y  Xb) '(Y  Xb)rank( R)
where b   X ' X  X ' y , b2  D 1 X 2' M 1Y ,
1
D  X 2 ' M1 X 2 , M1  I  X1 ( X1' X1 )1 X1'  I  PX1 . Correspondingly, the reject region
of H 0 is given by F  F (V  s, n  V ) (see Rao & Toutenburg 1999).
In Oleksiak et al. (2002)’s study, the design structure of graphic representation
could look up their paper, they used S=2V (S=30) efficient loop design for the
microarray experiment and examined 15 individuals (5 each from northern and
southern populations of Fundulus heteroclitus and 5 from the sister taxon Fundulus
grandis.) in order to determine the variation in gene expression within and between
populations. Their ultimate experimental objective is to test the null hypothesis of no
individual variation. Treatment can take on one of three values corresponding to the
populations (northern F. heteroclitus, southern F. heteroclitus or F. grandis) and the
alternative hypothesis, treatment takes on 15 distinct values, one for each individual.
Therefore, under log Ratio model, the hypothesis is equivalent to
H 0 : 1  4   13 , 2  5   14 , 3  6   15 vs H1 : not H 0 
H0 : K '   r vs H1 : K '   r , where the form of K ' refer to the matrix (section 2.3).
Hence, based on the microarray data and above-mentioned test rule, we could
calculate the F-statistics respectively for each specific gene under the normalized log
Ratio model and compare them to the F 12 , ng  15  distribution. Then we could
obtain the p-value of expression level to detect the significant case.
Literally, you could marshal numerous hypotheses which represent especial
genetic meanings separately to test via creating the form of K ' properly. Basically, the
comparisons between all pairs of varieties, all elementary contrasts, are regularly
interesting to biologists. Then, we also demonstrate the form of K ' as below:
0 0 0
 1 1 0 0 0
 1 0 1 0 0
0 0 0



0 0 1
1 0 0 0 0
2 1 1 1 1
1 1 1
L'  
0 0 0
 1 1 0 0 0


1 2 1 1 1 1 1 1 1


 1 1 1 1 1 1 1 1 2
0
0 


0
0

0


0


0 V (V 1)V
Thus, you could test which genes’ all elementary contrasts are significant. For specific
genes, if the result is significant, you could use the multiple comparison method to
trace that which elementary contrasts have significant differences.
Finally, by the way of testing the hypothesis, if you fix the threshold to determine
whether it is significant or not then we could obtain the particular genes significantly
different for each individual. We could record those particular genes or total
percentage to explore the genetic meaning or influence for evidencing some
characteristic or hypothesis of evolution.
5
Discussion
In this section, our principal points are to compare similarities and dissimilarities
between Kerr et al. (2000)’s ANOVA model and the log Ratio model we propose. We
discuss the log Ratio model’s rationality from the ANOVA model’s viewpoint and the
judgment viewpoint of optimal design at first. Furthermore, we discuss the
relationship of optimal criteria between the two models.
5.1 The rationality of model establishment and the judgment viewpoint of
optimal design
According to the experimental factors, Kerr et al. (2001) built the ANOVA model.
They identify four basic factors: varieties, genes, dyes and arrays, which are
represented by Vk i , j  , Gg , D j , Ai . zijk i , j  gr is represented the fluorescent intensity from
array i and dye j representing variety k and gene g. The model is
wijk i , j  gr =log 2 zijk i , j  gr  m  Ai  D j   AD ij  Gg   AG ig   DG  jg  VG k i , j  g   ijk i , j  gr .
global effects
gene  specific effects
Base on Kerr et al.’s model, the log Ratio could be express in the form
log 2  Ratioigr  =log 2
zi 2 k i ,2 gr
zi1k i ,1 gr
=  D2  D1    AD i 2   AD i1 
  DG 2 g   DG 1g   VG k i ,2 g  VG k i ,1 g    i 2 k i ,2 gr   i1k i ,1 gr 

 

    i  g  [ k (i , 2 ) g   k (i ,1) g ]  gri 
i
 Ci   g    k2 g   k1g   gr  ,
where Ci     i .
On the other hand, in fact, the full log Ratio model due to the raw data without
the normalization procedure can express as below:
M gr  M gr normalized    log 2 Ratioigr  log 2 i  A  i    g  kg  k' g   gr ,
or we could rewrite as
log 2  Ratioigr  i  log 2  i  A  i   g  kg  k' g   gr
1
normalized 
1
(raw)
(raw)
(raw ) (raw )
where Ratioigr  Rigr
, A  log 2 Rigr
Gigr , R(raw) and G (raw ) represent the
Gigr
raw data of red and green fluorescent intensity. The log2 i  A  term, which means
that we have different estimates of means for different intensity ranges within the
array i, can be estimated by local regression intensity-dependent normalization
method (lowess fit), i can be estimated by scale normalization method (median
absolute deviation MAD) (see Yang et al. (2002)).
Therefore, the log Ratio of ANOVA model is definitely similar to the log Ratio
model when the log Ratio model’s scale parameter i  1 and location parameter
log2i  A  Ci which is estimated by mean of the log ratios of the i-th array.
   i  log 2  y 2 y    log 2  y 1 y  




log 2  y i 2 k( i ,2 ) y  y i y 2   log 2  y i1k( i ,1 ) y y i y1 






i 
 i  i  

 log 2  y i 2 k( i ,2 ) y i1k( i ,1 )   log 2  R  G    m  C i  log 2  i  A  i .




Thus it can be seen, the establishment of the log Ratio model is rational even if it is in
terms of the ANOVA model’s viewpoint. By the way, if we set i  1 , then it means
1
that the data are without scale normalization. Moreover, if we also assume log 2i  A
 Ci is a constant factor and estimated by the mean of the log ratios of the ith array,
then it simultaneously reduces to global normalization (see Yang et al. (2002)).
In addition, for the sake of probing into the judgment viewpoints of optimal
design between us and Kerr et al., we must first analyze the structure of the design
space. It has some essential properties under ANOVA model: [1] A is orthogonal to D.
[2] A and D are partially confounded when V>2. [3] D and V could be completely
confounded, partially confounded or orthogonal. It hinges on the design you select. If
D and V are balanced then D is orthogonal to V. [4] (DV), (AV) and (AD) are
confounded with A, D and V respectively. [5] G is orthogonal with to A, D and V. This
property implies global effects are orthogonal to gene-specific effects. [6] (VG) and
(AG) are partially confounded as V>2 but log Ratio model we propose could solve
this critical problem. [7] The (DG) effects could be estimated and inserted in the
ANOVA model. Further, if D and V are balanced, then (DG) and (VG) are orthogonal.
In fact, the (DG) and (VG) are orthogonal if and only if the design is an even design.
[8] (DG) is orthogonal to (AG).
On the basis of properties [1]-[8], the above-mentioned ANOVA model considers
all main effects and two-factor interaction terms obviously. According to property [7],
that is reason why Kerr et al. search optimal design under even design set if they use
the above-mentioned model including (DG)’s effects. However, empirically, the
search of the result of the S=kV optimal design will be even designs, but the S=V+k
optimal design is not necessarily even designs any longer. Nevertheless, it is difficult
to act under ANOVA model without the condition of even design because it will let
their inference of optimal criterion become more complex. However, we overcome
this problem in this paper under the log Ratio model. The principal cause is that the
log Ratio model lets (AG)s’ effects be removed.
Through the analysis of the structure of design space, we have vital conclusions
that deserve to be mentioned. That is, if two factor effects are orthogonal in the design
space, then removal of any one factor doesn’t have an influence on the estimation of
the other effects. That is to say, we don’t lose any information when we estimate
another effect’s parameter. Based on the features of design space, we discuss the
model’s rationality from Kerr et al‘s point of view.
On the strength of property [5], ANOVA model’s global effects are orthogonal to
the gene specific effects in the structure of design space. Hence, if the global effects
are removed from the response, they don't influence the information of the estimation
of gene specific effects (this step is quiet global normalization process). Furthermore,
Kerr et al. search optimal design under even design set. Then, according to property
[7], we could also remove them. Thus, we could fix gene to rewrite the model as
'
wijk
 i , j  gr  wijk i , j  gr  global effects   DG  jg
 Gg   AG ig  VG k i , j  g   ijk i , j  gr , i  1,
,n, j  1,2.
It can be represented in the form for some gene
'
'
wikr
 G   AG i  VG k   ikr  wikr
   bi   k   ikr ,
where bi and  k could be looked upon as block i and treatment k. Owning to
experimental restriction, it isn’t possible to use size two blocks accommodating all the
treatments in each block when treatment size is bigger than block size 2. The
treatment effects and block effects are partially confounded. That is incomplete block
design. That is the reason why they could use some general properties of incomplete
bock design to obtain the A-optimality of the parameters of interest under ANOVA
model (see Raghavarao 1971). For the same reason, under log Ratio model, the effects
of log 2i  A and i are due to global effects actually. Therefore, we could remove
the global effects and fixed gene to construct our model:
The normalized log Ratio model: M gr   g  k2 g  k1 g   gr .
Based on this model, we could obtain the 0 -optimality (D-optimality) by using the
properties of the optimal theory of the general linear model (see section 2).
5.2 The relationship of optimal criteria
In this paragraph, under the log Ratio model, we discuss the criterion of
A-optimality that Kerr et al. provided to evaluate the design. The criterion minimizes
the average variance of all elementary contrasts of interest. In fact, this criterion is a
particular  p  CK  M   -optimality for p=-1, K=L.
 1 
 1 
ˆ
 (V )   var( k2 g   k1g ) =  (V )  tr(V ( L '  ))
 2 
 2  k1 k2
1
2
2
2




 n =  1 V  tr( L '( X ' X ) 1 L) n
=  1 V  tr( n L ' ML) =
(
)
(
)
1  CL  M  
 2 
 2 

 2
1
2



 n ,
=  1 V  V  tr( J '( X ' X ) 1 J )  E1,V ( J '( X ' X ) 1 J ) EV ,1  n = V
(
)
1  CJ  M  
 2 

0


from which we could know


min  1 V   var( k2 g   k1g )  max 1  CL  M    max 1  CJ  M   .
M
M
(
)
 2  k1 k2
Thus, if we want to minimize the variance of all elementary contrasts, we only need to
ensure the minimization of the variance of the estimator  k1
 k . Hence, we could
V
know the criterion max 1  CJ  M   Kerr et al. provided only consider all kg 's effects,
M
and max 0  CI  M     det CI  M   V we proposed consider all gene specific effects
1
M
including the  g and kg 's effects.
The model and criterion we proposed are rational and do not conflicting with Kerr
et al.’s viewpoint. Besides, our model has many merits. First of all, we could conquer
the problem of the inference of the theoretical upper bound of the optimal design
under the log Ratio model because the model is trimmed and streamlined an
organization of the structure of design space. Moreover, the optimal criterion
considers all variance of gene specific effects. Further, its normalization processes
such as nonlinear location and scale normalization could be considered more plausible
than the ANOVA model. The results of S=2V optimal designs, which use the
0 -optimality criterion under log Ratio model are the same with Kerr et al. (2000),
which use A-optimality under classical ANOVA model (see appendix). If you set a
constraint to search the optimal design within “even” designs, the result of S=V+2
optimal design will be the same. Thus it can be seen, although we and Kerr et al. start
with different viewpoints and using different optimal theory to derive optimal
criterion for evaluating design under different model, the results still have
consistence.
[Appendix] The 0 -optimality design:
S=V+2 optimal design:
V=6
V=7
V=8
V=9
V=10
V=11
V=12
V=13
figure [1]
S=2V optimal design:
V=5
V=6
V=7
V=8
V=9
V=10
V=11
V=12
figure [2]
Reference:
[1] Churchill, G. A. (2002). Fundamentals of experimental design for cDNA
microarrays. Nature genetics supplement. vol. 32, 490-495.
[2] Cleveland, W. S. (1979). Robust locally weighted regression and smoothing
scatter plots. J. Amer. Statist. Assoc. 74,829-836.
[3] Hwang, F. K. (2001). A complementary survey on double-loop networks.
Theoretical computer science, 263 211-229.
[4]Ideker, T., Thorsson, V., Siegel, A. F. and Hood, L. E.(2000). Testing for
differentially-expressed genes by maximum-likelihood analysis of Microarray
Data. J. Computat.Graph. Statist. 5, 299-314.
[5] John, Peter W. M. (1980). Incomplete Block Designs. New York & Basel /
Marcel Dekker.
[6] Kerr, M. K. and Churchill, G. A. (2001). Experimental design for gene expression
microarrays. Biostatistics 2, 183-201.
[7] Kerr, M. K. and Churchill, G. A. (2001). Statistical design and the analysis of
expression microarray data. Genet. Res. 77, 123-128.
[8] Kerr, M.K., Martin, M., and Churchill, G.A. (2000). Analysis of variance for
gene expression microarray data. Journal of Computational Biology, to appear.
[9] Luenberger, David G. (1968). Optimization by Vector Space Methods. John
Wiley & Sons, New York.
[10] Oleksiak, M. F., Churchill, G. A. and Crawford D. L. (2002).Variation in gene
expression within and among natural populations. Nature genetics.
vol. 32,261-266
[11] Pukelsheim, Friedrich (1993). Optimal Design of Experiments. John Wiley &
Sons, New York.
[12] Quackenbush, J. (2002). Microarray data normalization and transformation.
Nature genetics supplement. vol. 32, 496-501.
[13] Raghavarao, D. (1971). Constructions and Combinatorial Problems in Design of
Experiments. New York: Wiley.
[14] Rao, C. R. and Toutenburg, H. (1999). Linear Model. Springer, New York.
[15] Schuchhardt, J., Beule, D., Malik, A., Wolski, E., Eickhoff, H., Lehrach, H., and
Herzel, H. (2000). Normalization strategies for cDNA microarrays. Nucleic
Acids Research, vol. 28, No. 10.
[16] Wu, H., Kerr, M.K., Cui, X. and Churchill, G.A. (2002).MANOVA: a software
package for the analysis of spotted cDNA microarray experiments. to appear.
[17] Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J. and Speed, T. P.
(2002). Normalization for cDNA microarray data:a robust composite method
addressing single and multiple slide systematic variation. Nucleic Acids Research,
vol. 30, No. 4 e15.
[18] Yang, Y. H. and Speed, T. (2002). Design issues for cDNA microarray
experiments. Nature genetics. vol. 3 579-588.
Download