L10_WritngDesignMatrix.pptx

advertisement
Lecture 6
Design Matrices and ANOVA and how
this is done in LIMMA
ANOVA: Some Examples
• Is there a difference in the mean hourly wages for three
different ethnic groups?
• Is there a difference in the mean sugar content in five
different brands on cereal?
• IS there a difference between Mutant and Wild Type
version of the organisms
• IS there a dye effect, as well as a treatment effect?
• For a time course experiment are there significant
differences in gene expression for the different time
points?
Model for ANOVA
• The general linear model which applies for ANOVA,
Regression as well as ANCOVA is written as:
• Y=
X b
+
e
•
(nX1)
•
•
•
•
•
This is the matrix formulation of the model.
Y: response vector (observed)
X: design matrix (observed)
b: parameter vector (to be estimated)
e: error vector (unobserved, randomness)
(nXp)
(pX1)
(nX1)
How to write a Design Matrix
• Consider a data set where we are looking at comparing 3
different fertilizers, A, B and C. For each fertilizer we have two
plot of lands.
• Data:
Plot
Fertilizer
Yield (TONNES)
1
A
12
2
A
15
3
B
21
4
B
18
5
C
10
6
C
9
Models: cell means model
• We can write this as:
• Yij = mi + eij
• This is the cell-means model
• The corresponding design matrix is:
1
0
0
1
0
0
0
1
0
0
1
0
0
0
1
0
0
1
Each row corresponding to the unit, each column corresponding to the
Treatment
Model: Factor effect Model
• We can write this as:
• Yij = m + ti + eij
• This is the factor effect model, here we have an OVERALL mean and the ti
are the differences of each treatment level /factor from the overall mean.
Here we put the added requirement that S ti = 0
• The corresponding design matrix is:
1
0
0
1
0
0
1
1
0
1
1
0
1
-1
-1
1
-1
-1
Each row corresponding to the unit, each column corresponding to the
Treatment, but the last treatment is expressed in terms of the other
treatments.
Parameter Vectors
• For the cell means model:
b= (m1 m2 m3)’
HO: m1= m2= m3
• For the factor effects model:
b= (m t1 t2)’
HO: t1= t2= t3=0
Usage
• Most of Statistics uses the Factor effects model as it makes
the interpretation of the hypothesis easier as we are testing
our null that all the treatment effects are 0.
• However, in LIMMA in R we will use the easier cell-means
model for design matrix construction and we need to define a
contrast matrix.
LIMMA and Design Matrices
• This is what LIMMA says about constructing design Matrices:
•
• “The package limma uses an approach called linear models to
analyse designed microarray experiments. This approach allows
very general experiments to be analysed just as easily as a simple
replicated experiment.
• The approach requires one or two matrices to be specified. The first
is the design matrix which indicates in effect which RNA samples
have been applied to each array. The second is the contrast matrix
which specifies which comparisons you would like to make between
the RNA samples. For very simple experiments, you may not need
to specify the contrast matrix.”
More on Design Matrices
• The philosophy of the approach is as follows. You have to start by fitting a
linear model o your data which fully models the systematic part of your
data. The model is specified by the design matrix. Each row of the design
matrix corresponds to an array in your experiment and each column
corresponds to a coefficient which is used to describe the RNA sources in
our experiment. With Affymetrix or single-channel data, or with two-color
with a common reference, you will need as many coefficients as you have
distinct RNA sources, no more and no less.
•
• With direct-design two-color data you will need one fewer coefficient
than you have distinct RNA sources, unless you wish to estimate a dyeeffect for each gene, in which case the number of RNA sources and the
number of coefficients will be the same. Any set of independent
coefficients will do, providing they describe all your treatments. The main
purpose of this step is to estimate the variability in the data, hence the
systematic part needs to be modeled so it can be distinguished from
random variation.
LIMMA: contrasts
• In practice the requirement to have exactly as many
coefficients as RNA sources is too restrictive in terms of
questions you might want to answer. You might be interested
in more or fewer comparisons between the RNA source.
Hence the contrasts step is provided so that you can take the
initial coefficients and compare them in as many ways as
you want to answer any questions you might have,
regardless of how many or how few these might be.
Writing out Design and Contrast Matrices:
• Example 1:
• This a one-factor ANOVA with 4 levels.
• The model is Yij = mi + eij, i =1,…,4, j=1…3.
• Write out the contrast matrix if we were interested in
comparing level 1 to level 2, and level 3 to the mean of level 1
and 2.
Example 1: Designs and Contrast Matrices
• The contrast matrix for
comparing: so that
• B= C’D
comparing
level 1 to level 2,
level 3 to the mean of level 1
and 2.
array
m1
m2
m3
m4
1
1
0
0
0
2
1
0
0
0
3
1
0
0
0
4
0
1
0
0
5
0
1
0
0
6
0
1
0
0
7
0
0
1
0
8
0
0
1
0
c1
1
-1
0
0
9
0
0
1
0
c2
-1/2
-1/2
1
0
10
0
0
0
1
11
0
0
0
1
12
0
0
0
1
Example 2
• This a two-factor ANOVA with 3 levels for
Factor A and 2 levels for Factor B.
• The model is Yij = mi + bj+ eij, i =1,…,3, j=1…2.
• Write out the contrast matrix for comparing
Factor 1, levels 2 and 3 and Factor 2 levels 1
and 2.
•
Example 2: Design and Contrast Matrix
The Design Matrix
array
a1b1 a1b2 a2b1
a2b2
a3b1
a3b2
1
1
0
0
0
0
0
2
1
0
0
0
0
0
3
0
1
0
0
0
0
4
0
1
0
0
0
0
5
0
0
1
0
0
0
6
0
0
1
0
0
0
7
0
0
0
1
0
0
8
0
0
0
1
0
0
9
0
0
0
0
1
0
10
0
0
0
0
1
0
11
0
0
0
0
0
1
Write out the contrast
matrix for comparing :
Factor 1, levels 2 and 3
Factor 1: levels 1 and 3
Factor 2 levels 1 and 2.
• Contrast:
C1: 0 0 -1 -1 1 1
C2: -1 -1 0 0 1 1
C2: -1 1 -1 1 -1 1
Differential Expressions for Factorial Designs:
Design Matrices and Contrasts, using R.
• Example The Estrogen Data set:
• Let us consider the Estrogen Data set, and look at how we use R to
look at differential expressions using design matrices.
•
• Name
FileName
Target
• Abs10.1
low10-1.cel
EstAbsent10
• Abs10.2
low10-2.cel
EstAbsent10
• Pres10.1
high10-1.cel
EstPresent10
• Pres10.2
high10-2.cel
EstPresent10
• Abs48.1
low48-1.cel
EstAbsent48
• Abs48.2
low48-2.cel
EstAbsent48
• Pres48.1
high48-1.cel
EstPresent48
• Pres48.2
high48-2.cel
EstPresent48
Description of Experiment
• There are 8 files in all, coming from a 2X2 factorial design.
This is a design where there are 2 factors each at 2 levels. The
study was done to measure the changes in gene expression
for breast cancer patients due to estrogen (two levels
Presence and Absence) at two time points (10hr and 48hr).
This experiment data is available at the Bioconductor website.
Contrasts of Interest
• It is of interest to compare:
1. the effect of estrogen at 10 hours (compare
presence to absence at 10 hours),
2. the effect of estrogen at 48 hours (compare
presence and absence at 48 hours)
3. the effect of time in the absence of estrogen
(compare Absent 10 to Absent 48).
Targets File Method
• To do this in R we can use different ways. Lets
use the Targets file method as we did in 2
condition comparison before.
• So lets first put together a tab-delimited text
file like the one above. I call it
EstrogenTargets.txt so it describes a name, the
filename and the targets containing the factor
level infromation
Design matrix method
•
One way to do this in R (to me it’s the simplest one in terms of Design matrices), is to
write a Design Matrix using the factor combinations, WITHOUT the intercept term.
•
•
R (at least LIMMA) writes the Design matrix as:
•
•
•
•
•
•
•
•
•
EstAbsent10
1
1
0
0
0
0
0
0
•
So our model is Y = Xag + e
EstPresent10
0
0
1
1
0
0
0
0
EstAbsent48
0
0
0
0
1
1
0
0
EstPresent48
0
0
0
0
0
0
1
1
Contrast Matrix
•
•
•
•
•
•
•
•
•
•
•
•
•
Now to define the contrast we need to look at the transformation:
bg = C’ag
so, we define C as:
C’ =
-1
0
-1
This will define:
1
0
0
0
–1
1
0
1
0
(EstPresent10-EstAbsent10)
(EstPresent48-EstAbsent48)
(EstAbsent48-EstAbsent10)
In R using Targets file
design=model.matrix(~1+factor(targets$Target,level=unique(targets$Target)))
colnames(design)=unique(targets$Target)
numParameters=ncol(design)
parameterNames=colnames(design)
contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,-1,0,1,0),nrow=ncol(design))
Using the Targets file, efficient if you know how R works and you don’t
have to put in the Matrix.
In R using the design matrix directly
• design<-matrix(c(1,0,0, 0,1,0,0,0,0,1,
0,0,0,1,0,0,0,0,1,0,0,0,1,0,0 ,0,0,1,0,0,0,1),nrow=8)
• contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,1,0,1,0),nrow=ncol(design))
• R constructs the matrices using the columns.
An example for Optimal Designs
• Suppose we have 12 arrays in a single channel framework and
we have 5 conditions that we want to compare.
• Because of the unbalance it is harder to design orthogonal
designs here.
• Sometimes people use classes of design that are already
available and have properties like orthogonality.
• Designs in this class include: Margolin Designs (less than 6
conditions), Plackett-Burman designs and other such designs.
Consider the following Margolin Design:
orthogonal for 6 conditions and 12 arrays
•
•
•
•
•
•
•
•
•
•
•
•
1
1
1
1
1
1
1
1
1
1
1
1
1
1
-1
-1
-1
-1
-1
-1
1
1
1
1
1
-1
1
-1
-1
-1
-1
1
-1
1
1
1
1
-1
-1
1
1
-1
-1
1
1
-1
-1
1
1
-1
-1
1
-1
1
-1
1
1
-1
1
-1
1
-1
-1
-1
1
1
-1
1
1
1
-1
-1
1
1
1
1
1
1
-1
-1
-1
-1
-1
-1
1
2
3
4
5
6
7
8
9
10
11
12
What if I have 5 conditions
• In some ways we could drop one column and use the Design
matrix with the dropped column to preserve some optimality
conditions.
• Question is which column to drop?
• The following R-code helps us decide whether we drop
column 2 or 3 or 4.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
A<-matrix(c(1,1,1,1,1,1,1,
+ 1,1,-1,-1,-1,-1,1,
+ 1,-1,1,-1,-1,-1,1,
+ 1,-1,-1,1,1,-1,1,
+ 1,-1,-1,1,-1,1,1,
+ 1,-1,-1,-1,1,1,1,
+ 1,-1,-1,-1,-1,-1,-1,
+ 1,-1,1,1,1,1,-1,
+ 1,1,-1,1,1,1,-1,
+ 1,1,1,-1,-1,1,-1,
+ 1,1,1,-1,1,-1,-1,
+ 1,1,1,1,-1,-1,-1), nrow=12)
> B<-t(A)
> C<-B%*%A
> D<-solve(C)
> det(D)
[1] 5.353961e-07
> sum(diag(D))
[1] 2.065789
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
> A1<-A[,-2]
> A2<-A[,-3]
> A4<-A[,-4]
> A1t=t(A1)
> A2t=t(A2)
> a3t=t(A4)
> A4t=t(A4)
> a1ta1=A1t%*%A1
> a2ta2=A2t%*%A2
> a4ta4=A4t%*%A4
> b1=solve(a1ta1)
> b2=solve(a2ta2)
> b3=solve(a4ta4)
> aa1=sum(diag(b1))
> aa2=sum(diag(b2))
> aa4=sum(diag(b3))
Results from dropping columns
•
•
•
•
•
•
•
•
•
•
•
•
>aa1
[1] 1.256966 (trace after dropping col 2)
> aa2
[1] 0.8231631 (trace after dropping col 3)
> aa4
[1] 1.322289 (trace after dropping col 4)
> det(b1)
[1] 3.023413e-06 (determinant after dropping col 2)
> det(b2)
[1] 1.216143e-06 (determinant after dropping col 3)
> det(b3)
[1] 2.941453e-06 (determinant after dropping col 4)
Download