T test and ANOVA examples for filtering genes in microarray

advertisement
T test and ANOVA examples for filtering genes in
microarray experiments
T test
Here is part of a gene expression matrix (with log2 ratios relative to a reference). In
the first 5 arrays, the test sample comes from a normal mouse and in the 2nd 5 arrays,
the test sample comes from a mutant mouse:
Normal mice
Gene 1
Gene 2
………
Array
no. 1
0.46
0.15
…
Array
no. 2
0.30
0.74
…
Array
no. 3
0.80
0.04
…
Mutant mice
Array
no. 4
1.51
0.10
…
Array
no. 5
0.90
0.20
…
Array
no. 6
-0.10
-0.45
…
Array
no. 7
0.49
-1.03
…
Array
no. 8
0.24
-0.79
…
Array
no. 9
0.06
-0.56
…
Array
no.10
0.46
-0.32
…
We want to perform a two-sample T test on each gene to see if it is differentially
expressed between normal and mutant mice. [As mentioned, the 5 repeat experiments
with several arrays used to compare normal/mutant mouse RNA to a reference allow
us to distinguish, to some extent, between differences of intensity measurement due to
gene expression differences between normal and mutant mice and those due to other
factors such as a slightly defective array chip.] We start with gene 1:
We think of the row for gene 1 as consisting of two “samples”, in the statistical sense,
(ie. normal & mutant). We want to compare samples to see if they have equal means.
We test this hypothesis using a two-sample T test.
Recall that, when we have a sample of length m and another of length n, the Tstatistic follows a t-distribution with n+m-2 degrees of freedom and is given by
T
(X Y )

( X  Y ) / ˆ jo int
.
2
2
1 1
nS X  mSY
1 1


n m
nm2
n m
where ˆ jo int is an unbiased estimator for the standard deviation and has
computational formula:
n
ˆ jo int 2 
nS X  mSY

nm2
2
2
m
 X i  nX 2   Yi  mY 2
2
1
1
nm2
2
.
Both of the samples have size 5, so we have m=n=5 and we are considering a tdistribution with 8 degrees of freedom.
The “division factor”
1 1
2
 =
 0.6325.
n m
5
1
Next we calculate the sample means:
Normal mice:
0.46
0.30
0.80
1.51
0.90
total 3.97
Mutant mice:
-0.10
0.49
0.24
0.06
0.46
1.15
mean 0.794
Difference between means = 0.564
0.23
Next we work out ˆ jo int by first computing the sums of squares.
n
The sums of squares for the normal mice,
X
2
i
, is
1
(0.46)2+(0.30)2+(0.80)2+(1.51)2+(0.90)2=4.0317.
Correction factor, nX 2 , to subtract to give n times normal sample variance also called
the sum of squared residuals
=5*(0.794)2=3.1522
The sums of squares for the mutant mice, is likewise 0.5229.
So
ˆ jo int 2 
4.0317  5 * (0.794)   0.5229  5 * (0.23)   0.8795  0.2584  0.1422
and ˆ jo int =0.3771.
2
2
8
8
Thus our T statistic is given by 0.564/(0.3771*0.6325) = 2.365.
Suppose we want to test whether gene 1 is up-/ downregulated between normal and
mutant mice at 5% (often written as 95%) significance. We look up in T test table
under the t-distribution with 8 degrees of freedom for the value that the T statistic
would take at that confidence and find it is 2.306 (see table). Our T statistic exceeds
this and thus the mean of the mutant sample is different from the mean of the normal
sample at 5% significance. So our gene is significantly up-/ downregulated in mutant
mice. [In fact we can see that the P value is less than but close to 0.05].
[NB we are doing a two-tailed test here. The T statistic has probability 0.025 of
exceeding 2.306 by chance. It has an additional probability 0.025 of being less than
–2.306, since the t-distribution is symmetrical.]
2
Gene 2:
Likewise for gene 2, we have
0.15  0.74  0.04  0.1  0.2
 0.246
5
 0.45  1.03  0.79  0.56  0.32
 0.63
Sample mean (mutant)=
5
Difference between means=0.876
Sample mean (normal)=
Division factor,
1 1
 , =0.6325, as before.
n m
Sum of squares (normal)= 0.15 2  0.74 2  0.04 2  0.12  0.2 2  0.6217
Correction factor, nX 2 ,=5*(0.246)2=0.3026
Sum of squares (mutant)=
(0.45) 2  (1.03) 2  (0.79) 2  (0.56) 2  (0.32) 2  2.3035
Correction factor (mutant), mY 2 , =5*(-0.63)2 =1.9845
So the estimated standard deviation, ˆ jo int is given by
ˆ jo int 
[0.6217  0.3026]  [2.3035  1.9845]
0.3191  0.319
=0.2824

552
8
So our T statistic is given by
T=
0.876
=4.904.
0.6325 * 0.2824
Once again we are comparing with a t-distribution with 5+5-2=8 degrees of
freedom. If we test at (5% or in the notation of the table) 95% significance, there is
clearly a significant difference between the normal and mutant mice. If you look at
the T table it is clear that gene 2 has a P value that is <0.01 for the two-tailed T
test.
3
Analysis of variance (ANOVA)
We could just perform analysis of variance for each gene of the table above to test for
differences in means of the normal mouse and mutant mouse samples. We should get
the same answers as before for significance values, because ANOVA with two groups
is formally equivalent to the T test. However, to illustrate why we want to use
ANOVA, we will consider data from a second mutant type of mouse, which we will
call “double mutant”. We only manage to get three microarray experiments with
cDNA from this type of mouse.
Gene 1
Gene 2
………
Array
no. 11
-0.34
-1.02
…
Array
no. 12
0.56
-0.52
…
Array
no. 13
0.33
0.01
…
Analysis of variance for gene 1:
Data for gene 1:
Normal
0.46
0.30
0.80
1.51
0.90
Mutant
-0.10
0.49
0.24
0.06
0.46
Double mutant
-0.34
0.56
0.33
We need to calculate
1. the sum of squared residuals within groups
2. the sum of squared residuals between groups:
Group mean (X )
Sum of squares in group
 nj

  X ij 2 
 i 1



Sum of squared residuals
within group
 nj

  ( X ij  X ) 2 
 i 1



Normal
0.794
Mutant
0.23
Double mutant
0.1833
4.0317
0.5229
0.5381
0.8975
0.2584
0.4373
[The sum of squared residuals = sum of squares – correction factor].
4
Recall “sum of squares within” (actually means sum of squared residuals) is

SS w   j 1 i j1 ( X ij  X j ) 2   squared data    Group size(group mean) 2
m
n
Normally the latter formula would be easier to use for computations, but we have
already worked out the individual group residuals. SSw is just the sum across groups
of the SS residual within the group, ie.
SS w =0.8975+0.2584+0.4373=1.5752.
Grand mean= (5*0.794+5*0.23+3*0.1833)/13=0.4362.
“Sum of squares between” is


SS B   j 1 n j ( X j  X ) 2   Group size(group mean) 2  N (Grand mean) 2
m
=5*(0.794)2+5*(0.23)2+3*(0.1833)2-13*(0.4362)2=1.0445.
We present the values of the statistics required to compute the F statistic in a table:
Sum of squared
residuals (SS)
Between samples
1.0445
Within samples
1.5752
Total
2.6197
[NB here m is no. of groups.]
Degrees of freedom Mean squared
(MS)
m-1=2
0.5223
N-m=10
0.1575
N-1=12
[Note that an alternative way of working out the within samples sum of squared
residuals is to work out the total and between samples SSs and subtract “between”
from “total”. This is because
2
2
2
i, j X ij  X   i, j X ij  X j    j n j X j  X  .
Using Total SS in the form

i, j
X ij  NX 2 is usually the easiest way to compute
2
these numbers (if we hadn’t already done a T test).]
Thus the F statistic is given by
MSB Mean squared between 0.5223
=3.316.
F2,10 


MSE
Mean squared within
0.1575
We are comparing with an f-distribution with degrees of freedom for the numerator
of 2 and degrees of freedom for the denominator of 10. From the table we see that
this is not significant at the 5% level (F value is 4.10). Thus at 5% significance the
genes do not vary across the three conditions. The P value is actually 0.079. If we
wanted to know specifically whether a gene varied between the two mutant types, then
we would have to perform pairwise T tests (or ANOVA) on the data.
[NB the ANOVA F test is two-tailed.]
5

Computational formulae:
Total Sum of Squared Residuals   j,i X ij  NX 2
2
Within Groups SS residuals   j,i X ij   j n j X j
2
2
Between Groups SS residuals   j n j X j  NX 2
2
6
Download