Section 16 regression. Analysis of variance.

advertisement
Section 16
Linear constraints in multiple linear
regression. Analysis of variance.
Multiple linear regression with general linear constraints. Let us consider a multiple
linear regression Y = X∂ + β and suppose that we want to test a hypothesis given by a set
of s linear equations. In a matrix form:
H0 : A∂ = c,
where A is a s × p matrix and c is a s × 1 vector. We will assume that s � p and the
matrix A has rank s. This generalizes two types of hypotheses from previous lecture, when
we considered only one linear combination of parameters (s = 1 case) or tested hypothesis
about all parameters simultaneously (s = p case).
To test this general hypothesis, a natural idea is to compare how far A∂ˆ is from c and
ˆ Clearly, this distribution is normal with
to do this we need to find the distribution of A∂.
mean A∂ and covariance
EA(∂ˆ − ∂)(∂ˆ − ∂)T AT = ACov(∂ˆ)AT = χ 2 A(X T X)−1 AT = χ 2 D
where we introduced a notation
D := A(X T X)−1 AT .
A matrix D is a symmetric positive definite invertible s × s matrix and, therefore, we can
take its square root D 1/2 . It is easy to check that the covariance of D −1/2 A(∂ˆ − ∂) is χ 2 I.
This implies that
1 −1/2 ˆ
1
|D
A(∂ − ∂)|2 = 2 (A(∂ˆ − ∂))T D −1 A(∂ˆ − ∂) � λ2s .
2
χ
χ
Under null hypothesis, A∂ = c, we get
1
(A∂ˆ − c)T D −1 (A∂ˆ − c) � λ2s .
χ2
113
(16.0.1)
ˆ we get
Since nχ̂ 2 /χ 2 � λ2n−p is independent of ∂,
� nχ̂ 2
1
ˆ − c)T D −1 (A∂ˆ − c)
(A∂
sχ 2
(n − p)χ 2
n−p ˆ
=
(A∂ − c)T D −1 (A∂ˆ − c) � Fs,n−p.
(16.0.2)
nsχ̂ 2
This is enough to test hypothesis H0 . However, in a variety of applications a different equiv­
alent representation of (16.0.1) is more useful. It is given in terms of MLE ∂ˆA of ∂ that
satisfies the constraint in H0 . In other words, ∂ˆA is obtained by solving:
|Y − X∂ |2 � min
�
subject to the constraint A∂ = c.
(16.0.3)
Lemma. If ∂ˆA is solution of (16.0.3) then the left hand side of (16.0.1) is equal to
1
|X(∂ˆA − ∂ˆ)|2 .
χ2
(16.0.4)
Proof. First, let us find the constrained MLE ∂ˆA explicitly. By method of Lagrange
multipliers we need to solve a system of two equations:
⎠
� ⎟
2
T
A∂ = c,
|Y − X∂| + (A∂ − c) � = 0,
�∂
where � is a s × 1 vector. The second equation is
−2X T Y + 2X T X∂ + AT � = 0.
Solving this for ∂ gives
1
1
∂ˆA = (X T X)−1 X T Y − (X T X)−1 AT � = ∂ˆ − (X T X)−1 AT �.
2
2
Since ∂ˆA must satisfy the linear constraint, we get
1
1
c = A∂ˆA = A∂ˆ − A(X T X)−1 AT � = A∂ˆ − D�.
2
2
Solving this for �, � = 2D −1 (A∂ˆ − c), we get
∂ˆA = ∂ˆ − (X T X)−1 AT D −1 (A∂ˆ − c).
and, therefore,
X(∂ˆA − ∂ˆ) = −X(X T X)−1 AT D −1 (A∂ˆ − c).
We can use this formula to compute
|X(∂ˆA − ∂ˆ)|2 = (X(A∂ˆ − ∂ˆ))T X(∂ˆA − ∂ˆ)
= (A∂ˆ − c)T (X(X T X)−1 AT D −1 )T X(X T X)−1 AT D −1 (A∂ˆ − c)
= (A∂ˆ − c)T D −1 A(X T X)−1 X T X(X T X)−1 AT D −1 (A∂ˆ − c).
= (A∂ˆ − c)T D −1 A(X T X)−1 AT D −1 (A∂ˆ − c)
= (A∂ˆ − c)T D −1 DD −1 (A∂ˆ − c)
= (A∂ˆ − c)T D −1 (A∂ˆ − c).
114
Comparing with (16.0.1) proves Lemma.
Using (16.0.2) and Lemma, we get that under null hypothesis H0 :
n−p
|X(∂ˆA − ∂ˆ)|2 � Fs,n−p .
nsχ̂ 2
(16.0.5)
There are many different models that are special cases of a multiple linear regression
and many hypotheses about these model can be written as a general linear constraints. We
will describe one such model in detail - one-way layout in analysis of variance. Then we will
describe a couple of other models without going into details since the idea will become clear.
Analysis of variance: one-way layout. Suppose that we are given p independent
samples
Y11 , . . . , Y1n1 � N(µ1 , χ 2 )
..
.
Yp1 , . . . , Ypnp � N(µp , χ 2 )
of sizes n1 , . . . , np correspondingly. We assume that the variance of all distributions are equal.
We would like to test the hypothesis that the means of all distributions are equal,
H0 : µ 1 = . . . = µ p .
This problem is in fact a special case of a multiple linear regression and testing hypothesis
given by linear equations. We can write
Yki = µk + βki , where gki � N(0, χ 2 ), for k = 1, . . . , p, i = 1 . . . , ni .
Let us consider n × 1 vector, where n = n1 + . . . + np ,
Y = (Y11 , . . . , Y1n1 , . . . , Yp1 , . . . , Ypnp )T
and p × 1 parameter vector
µ = (µ1 , . . . , µp )T .
Then we can write all the equations in a matrix form
Y = Xµ + β,
where X is the following n × p matrix:
⎞
1
� ..
� .
�
� 1
�
� 0
� .
� .
� .
X =
�
� 0
� .
� ..
�
� 0
�
� .
�
..
0
�
0
.. ⎜
. ⎜
⎜
0 ⎜
⎜
0 ⎜
.. ⎜
⎜
. ⎜
⎜ .
1 . . . 0
⎜
.. .. .. ⎜
. . . ⎜
⎜
0 . . . 1
⎜
⎜
.. .. .. ⎜
. . . ⎝
0 ... 1
0 ...
.. ..
. .
0 ...
1 ...
.. ..
. .
115
The blocks have n1 , . . . , np rows. Basically, the predictor matrix X consists of indicators to
which group the observation belongs to. The hypothesis H0 can be written in a matrix form
as Aµ = 0 for (p − 1) × p matrix
�
⎞
1 0 . . . 0 −1
� 0 1 . . . 0 −1 ⎜
�
⎜
A =
� .. .. .. .
.. ⎜ .
.
�
. . . . . ⎝
0 0 . . . 1 −1
We need to compute the statistic in (16.0.5) that will have distribution Fp−1,n−p . First of all,
�
⎞
n1 0 . . . 0
� 0 n2 . . . 0 ⎜
�
⎜
X T X = � ..
..
..
.. ⎜ .
�
.
.
.
. ⎝
0 0 . . . nr
Since µ̂ = (X T X)−1 X T Y it is easy to see that for each i � p,
n
i
1
�
µ̂i =
Yik = Y¯
i ­ the average of ith sample.
ni k=1
We also get
p
n
i
1
1
� �
χ̂ = |Y − Xµ|
2 =
(Yik − Ȳi )2 .
n
n i=1 k=1
2
To find the MLE µ̂A under the linear constraints Aµ = 0 we simply need to minimize
|Y − Xµ|2 over vectors µ = (µ1 , . . . , µ1 )T with all equal coordinates. But, obviously, Xµ is
a vector (µ1 , . . . , µ1 )T of size n × 1, so we need to minimize
p �
ni
�
(Yik − µ1 )2 min
µ1
i=1 k=1
and we get
Therefore,
and
p
n
i
1
� �
µ1 =
Yik = Y¯ - overall average of all samples.
n i=1 k=1
µ̂A − µ̂ = (Ȳ − Ȳ1 , . . . , Ȳ − Ȳp )T
p
p
ni
�
�
�
2
ni (Ȳi − Ȳ )2 .
|X(µ̂A − µ̂)| =
(Ȳi − Ȳ ) =
2
i=1
i=1 k=1
By (16.0.5), under the null hypothesis H0 ,
�p
2
n−p
i=1 ni (Ȳi − Ȳ )
�
�
F :=
� Fp−1,n−p .
i
p − 1 pi=1 nk=1
(Yik − Ȳi )2
116
(16.0.6)
In order to test H0 , we define a decision rule
⎛
H0 , F � c �
α=
H1 , F > c �
where Fp−1,n−p(c� , +∼) = �. The sum in the numerator in (16.0.6) represents the total
variation of the sample means Y¯i of each population around the overall mean Y¯ . The sum in
the numerator represent the total variation of the observations Yik around their particular
sample means Ŷi . This interpretation of the test statistic explains the name - analysis of
variance, or anova.
Example. Let us again consider normal body temperature dataset and perform anova
test to compare the mean body temperature for men and women. Previously we have tested
this using t-tests and KS test for two samples. We use Matlab function
[p,tbl,stats]=anova1([men, women])
where ’men’ and ’women’ are 65 × 1 vectors. For unequal groups ’anova1’ requires a second
argument with group labels. The output produces a table ’tbl’:
’Source’
’Columns’
’Error’
’Total’
’SS’
[ 2.7188]
[66.6262]
[69.3449]
’df’
[ 1]
[128]
[129]
’MS’
[2.7188]
[0.5205]
’F’
[5.2232]
’Prob>F’
[0.0239]
’SS’ gives the sum of squares in the numerator of (16.0.6) (’Columns’), denominator (’Error’),
and their total sum. Degrees of freedom ’df’ represent degrees of freedom p − 1 and n − p.
’MS’ represents the normalized sums of squares by corresponding degrees of freedom. ’F’ is a
statistic in (16.0.6) and ’Prob>F’ is a p-value corresponding to this F -statistic. This means
that at the level of significance � = 0.05 we reject the null hypothesis that the means are
equal.
Analysis of variance: two-way layout. Suppose that we again have samples from
different groups only now the groups will have two categories defined by two factors. For
example, if we want to compare SAT scores in different states but also separate public and
private schools then we will have groups defined by two factors - state and school type. We
consider the following model of the data:
Yijk = µij + βijk
for i = 1, . . . , a, j = 1, . . . , b and k = 1, . . . , nij , i.e. we have a categories of the first factor,
b categories of the second factor and nij observations in group (i, j). This model is not any
different from one-way anova, simply the groups are indexed by two parameters/factors, but
the estimation of parameters can be carried out as in the one-way anova. However, to test
various hypotheses about the effects of these two factors it is more convenient to write the
model in an equivalent way as follow:
Yijk = µ + �i + ∂j + δij + βijk
117
where we assume that
a
�
i=1
�i = 0,
b
�
∂j = 0,
j=1
b
�
i=1
δij =
b
�
δij = 0.
j=1
These constraints define all parameters uniquely from original parameters µij . Parameter µ is
called the overall mean. The reason we separate additive effects �i and ∂j of two factors from
the most general interaction effect δij is because it is easier to formulate various hypotheses
in terms of these parameters. For example:
• H0 : �1 = . . . = �a = 0 - the additive effect of the first factor is insignificant;
• H0 : ∂1 = . . . = ∂b = 0 - the additive effect of the second factor is insignificant;
• H0 : all δij = 0 - the effect of the interaction of both factors is insignificant, i.e. the
effect of factors is additive.
Matlab function ’anova2’ performs two-way layout of anova if the sizes of all groups n ij are
equal, i.e. the data is balanced. If the sizes of groups are different one should use ’anovan’ ­
a general N-way anova.
Analysis of covariance. This is another special case of multiple regression when all
groups of data have a continuous predictor variable. The model is:
Yik = � + �i + (∂ + ∂i )Xik + βik
for i = 1, . . . , a and k = 1, . . . , ni . We have a groups and ni observations in ith group. To
determine the parameters uniquely we assume that
a
�
i=1
�i = 0,
a
�
∂i = 0.
i=1
Example. (Fruitfly dataset) We consider a dataset from [1] (available on the journal’s
website) and [2]. The experiment consisted of five groups of male fruitflies, 25 male fruitflies in
each group. The males in each group were supplied with different number of either receptive
or non receptive females each day.
Group 1: 8 newly inseminated non-receptive females per day;
Group 2: no females;
Group 3: 1 newly inseminated non-receptive female per day;
Group 4: 1 receptive female per day;
Group 5: 8 receptive females per day.
The experiment was designed to test if the increased reproduction results in decreased
longevity, so the lifespan of each male fruitfly was the response variable Y .
One-way anova. Let us start with a one-way anova, i.e. we consider a model
Yij = µi + βik , where i = 1, . . . , 5, k = 1, . . . , 25
118
and test the hypothesis H0 : µ1 = . . . = µ5 . Suppose that ’lifespan1’ is a 25 × 5 matrix such
that each column contains observations from one group. Then running
[p,tbl,stats]=anova1(lifespan1);
produces the boxplot in figure 16.1 and a table ’tbl’:
’Source’
’Columns’
’Error’
’Total’
’SS’
[1.1939e+004]
[2.6314e+004]
[3.8253e+004]
’df’
’MS’
’F’
’Prob>F’
[ 4] [2.9848e+003] [13.6120] [3.5156e-009]
[120] [
219.2793]
[124]
p-value suggests how unlikely hypothesis H0 is. The boxplot suggests that the last group’s
100
90
80
Values
70
60
50
40
30
20
1
2
3
Column Number
4
5
Figure 16.1: Boxplot for one-way ANOVA.
lifespan is most different from the other four groups. As a result, we might want to test the
hypothesis H0 : µ1 = . . . = µ4 that the means of the first four groups are equal,
[p,tbl,stats]=anova1(lifespan1(:,1:4));
we get the following table
119
’Source’
’Columns’
’Error’
’Total’
’SS’
[
988.0800]
[2.2798e+004]
[2.3787e+004]
’df’ ’MS’
’F’
’Prob>F’
[ 3] [329.3600] [1.3869] [0.2515]
[96] [237.4842]
[99]
and we see that the p-value is 0.2515, so we accept H0 if the level of significance � � p-value.
Two-way anova. Let us now consider four groups without the second group (no females)
and test the effects of two factors:
• Factor A: ’receptive’ or ’non-receptive’;
• Factor B: ’1’ or ’8’.
This means that we consider a model
Yijk = µ + �i + ∂j + δij + βijk
for i = 1, . . . , 2, j = 1, . . . , 2 and k = 1, . . . , 25. To use Matlab function ’anova2’ we arrange
the data into a 50 × 2 matrix ’lifespan2’ such that two columns represent two categories of
Factor A, the first 25 rows represent group ’1’ in Factor B and rows 26 through 50 represent
group ’8’ in Factor B. Then
[p,tbl,stats]=anova2(lifespan2,25)
produces (here 25 indicates the number of replicas in one cell) the table
’Source’
’Columns’
’Rows’
’Interaction’
’Error’
’Total’
’SS’
[6.6749e+003]
[1.7223e+003]
[2.3717e+003]
[1.9817e+004]
[3.0586e+004]
’df’ ’MS’
’F’
[ 1] [6.6749e+003] [32.3348]
[ 1] [1.7223e+003] [ 8.3430]
[ 1] [2.3717e+003] [11.4890]
[96] [
206.4308]
[99]
’Prob>F’
[1.3970e-007]
[
0.0048]
[
0.0010]
p-values in the last column correspond to three hypotheses:
• H0 : �1 = �2 = 0, i.e. the effect of Factor A is insignificant;
• H0 : ∂1 = ∂2 = 0, i.e. the effect of Factor B is insignificant;
• H0 : δ11 = δ12 = δ21 = δ22 = 0, i.e. the effect of the ’interaction’ between Factors A
and B is insignificant.
Small p-values suggest that all these hypotheses should be rejected.
Analysis of covariance. Besides reproduction factors A and B, another continuous ex­
planatory variable for longevity was used - the length of thorax (a division of a body between
the head and the abdomen - chest). We are now in the setting of ancova:
Yik = � + �i + (∂ + ∂i )Xik + βik
for i = 1, . . . , 5 and k = 1, . . . , 25. Analysis of covariance tool in Matlab
120
aoctool(thorax,lifespan,groups);
produces the following output, figure 16.2:
100
100
1
2
3
4
5
90
80
70
1
2
3
4
5
80
60
60
50
40
40
30
20
20
10
0.65
0.7
0.75
0.8
0.85
0
0.9
0.65
0.7
0.75
0.8
0.85
0.9
Figure 16.2: Left column top to bottom: graph of fitted line for each group, estimates of coefficients,
ancova test table. Right column: same under assumption that all slopes are equal.
We see that the p-value of ’groups*thorax’ interaction, corresponding to the hypothesis that
all ∂i = 0, is equal to 0.9844, which means that we can accept this hypothesis. As a result,
we fit the model with equal slopes for all groups, figure 16.2, right column. The p-values for
’groups’ and ’thorax’, corresponding to the hypotheses all �i = 0 and ∂ = 0, are almost 0
and we should reject these hypotheses.
121
References.
[1] Hanley, J. A., and Shapiro, S. H. (1994), ”Sexual Activity and the Lifespan of Male
Fruitflies: A Dataset That Gets Attention,” Journal of Statistics Education, Volume 2, Num­
ber 1.
[2] Linda Partridge and Marion Farquhar (1981), ”Sexual Activity and the Lifespan of
Male Fruitflies,” Nature, 294, 580-581.
122
Download