The OLS estimato r

advertisement
The OLS estimator (b.l.u.e.) for is
7. Regression Analysis
b = (X T X )
7.1 Simple linear regression for normaltheory Gauss-Markov models.
" when does this exist?
Here
Yi = 0 + 1Xi + i
Model 1:
where i NID(0; 2) for i = 1; : : : ; n.
Matrix formulation:
2
3
2
Y1
1 X1
6
7
6
6 Y2 7
6 1 X2
6 . 7 = 6 .
..
4 . 5
4 .
Yn
1 Xn
3
2
7
7
7
5
#
n
2
1
0 + 666 2
4 .
1
n
"
1X T Y
3
X
6
n
Xi
6
i
=1
X T X = 666 X
n
n
X
4
Xi
Xi2
3
i=1
7
7
7
5
i=1
n
2
3
X
6
6
or
=1
X T Y = 666 iX
n
Y = X + 4
7
7
7
7
7
5
i=1
Yi
XiYi
7
7
7
7
7
5
369
370
Then
(X T X ) 1
b = (X T X )
2
=
n
n
X
i=1
Xi2
1
0
@
n
X
i=1
2
1
XiA
6
6
6
6
6
4
2
=
n
n
X
i=1
1
(Xi
X )2
6
6
4
n
X
Xi2
i=1
n
X
i=1
n
X
i=1
n
i=1
Xi
Xi2
nX
X
n
nX
n
3
7
7
5
Xi
1X T Y =
2 0
3
7
7
7
7
7
5
n
n
X
1
(Xi
i=1
X )2
n
6 @
6
6
6
6
4
X
i=1
and
2
"
#
6
6
Xi2
10
371
1
n
YiA nX
XiYi
i=1
i=1
n
n
X
X
nX
Yi + n XiYi
i=1
i=1
A@
X
Y b1X
n
X
(Xi X )Yi
6
b
b = 0 = 666 i=1
b1
n
6 X
4
n
X
i=1
(Xi
X )2
3
7
7
7
7
7
7
7
5
372
3
7
7
7
7
7
5
Analysis of Variance:
Covariance matrix:
n
X
i=1
V ar(b) = V ar (X T X ) 1X T Y
= (X T X ) 1X T (2I )X (X T X ) 1
= 2(X T X ) 1
1
X 2
X
n + (X X )2 (X X )2
2
= 1
X
(X X )2 (X X )2
2
6
6
4
i
i
i
YT Y
=
=
YT (I
YT (I
PX + PX P1 + P1)Y
PX )Y + YT (PX P1)Y + YT P1Y
%
%
7
7
5
i
1
n 2
Y(I
"
"Corrected
model"
sum of
squares
"
call this
R(1j0)
3
Correction
for the
"mean"
"
call this
R(0)
(i) By Cochran's Theorem, these three sums
of squares are multiples of independent chisquared random variables.
Sb = MSE (X T X ) 1
MSE = SSE=(n 2) =
=
SSE
Estimate the covariance matrix for b as
where
Yi2
PX )Y:
(ii) By result 4.7, 12 SSE 2(n 2) if the model
is correctly specied.
373
374
Correction for the overall mean:
Notation:
R(0) = YT P1Y
= YT (I I + P1)Y
= YT I Y YT (I P1)Y
Reduction in residual sum of squares:
R(k+1; : : : ; k+q j 0; 1; : : : ; k )
= YT (I PX 1 )Y YT (I PX )Y
"
=
"
sum of squared
residuals for the
smaller model
sum of squared
residuals for the
larger model
Here
X = [ X1 j X2 ]
%
columns
corresponding
to 0; 1; : : : ; k
-
n
X
i=1
(Yi
0)2
n
X
(Yi
i=1
Y )2
%
sum of squared
residuals from tting
the model
Yi = + i.
The OLS estimator for
= E (Yi) is
^ = (1T 1) 011T Y 1
columns
corresponding
to k+1 k+q
n
X
= (n) 1 @ YiA
= Y
375
i=1
376
An alternative formula
Reduction in the residual sum of squares for
regression on X1:
R(0) = YT P1Y
= YT 1(1T 1) 11T Y
0
=
@
n
X
i=1
0
A
@
Yi (n) 1
= (n) 1
= nY 2
1
0
@
n
X
i=1
1
Yi
2
n
X
i=1
R(1j0) =
=
=
=
1
Yi
A
A
YT (PX P1)Y
YT (PX I + I P1)Y
YT (I P1 (I PX ))Y
YT (I P1)Y YT (I PX )Y
%
"
sum of squared
sum of squared
residuals for
residuals for
tting the model tting the model
Yi = + i
Yi = 0 + 1Xi + i
with df = rank(P1) = rank(1) = 1:
377
378
F-tests
ANOVA table:
Source of
Sum of
variation
d.f.
Squares
Regression
on X
1 R(1j0) = YT (PX P1)Y
Residuals
n 2
YT (I PX )Y
Corrected
total
n 1
YT (I P1)Y
Correction
for the mean 1
YT P1Y = nY 2
379
From result 4.7 we have
1
1
R(0) = 2 YT P1Y 21(Æ2)
2
where
1
Æ2 = 2 T XT P1X
1
= 2 T XT P1T P1X 1
= 2 (P1X )T (P1X )
n
= 2 (0 + B1X )2
Hypothesis test: Reject H0 : 0 + 1X = 0 if
R(0)
> F(1;n 2); F=
MSE
380
Also use Result 4.7 to show that
1
1
SSE = 2 YT (I PX )Y 2(n 2)
2
Test the null hypothesis H0 : 1 = 0
F =
and, use Result 4.8 to show that
1
SSE = 2 YT (I PX )Y
is distributed independently of
1
R(0) = 2 YT P1Y :
This follows from
(I
=
R(1j0)=1
MSE
[YT (PX P1)Y]=[12]
]
[YT (I PX )Y]=[(n 2)2
F(1;n 2)(Æ2)
where
PX )P1 = 0 :
1
2
1
= 2
Æ2 =
Consequently,
R(0)
F(1;n 2)(Æ2)
F=
MSE
and this becomes a central F-distribution when
the null hypothesis is true.
T X T (PX
P1)X T X T (PX
P1)T (PX
P1)X %
The null hypothesis is
H0 : (PX P1)X = 0
381
382
Here
(PX
P1)[1jX]
P1)X = (PX
Consequently, if any Xi 6= Xj then
h
= (PX
P1)1 (PX
i
= PX 1 P11 PX X P1X
i
= [1 1 X X 1]
2
=
6
6
6
4
0
0
..
0
j X1 X
j X2 X
j ..
j Xn X
3
7
7
7
5
X = 0
and
P1)X = 0
if and only if
1 = 0 :
Hence, the null hypothesis is
H0 : 1 = 0
.
If any Xi 6= Xj , then we cannot have both
Xj
(PX
P 1 )X
h
Note that
n
X
1
Æ2 = 2 12 (Xi X )2
i=1
Xi = X = 0 :
383
384
Reparameterize the model:
Yi = + 1(Xi X ) + i
with i NID(0; 2); i = 1; : : : ; n.
Interpretation of parameters:
= E (Y ) when X = X
Maximize the power of the F-test for
H0 : 1 = 0 vs. HA : 1 6= 0
by maximizing
n
X
1
Æ2 = 2 12 (Xi X )2
i=1
1 is the change in E (Y ) when
X is increased by one unit.
385
Matrix formulation:
386
For this reparameterization, the columns of W
are orthogonal and
3
2
Y1
1 X1 X
6 . 7
6 .
..
4 . 5 = 4 .
Yn
1 Xn X
2
3
"
7
5
2
+ 64 ..1
1
n
#
3
7
5
2
WTW =
or
Y = W + (W T W )
1 =
6
6
4
2
4
Clearly,
n
0
n
X
i=1
0
(Xi
1
0
n
3
X )2
7
7
5
3
0 (X 1 X )2
5
i
"
W = X 10
"
2
#
X
1 = XF
WTY =
#
X = W 10 X
1 = WG
387
6
6
6
6
6
4
n
3
X
i=1
n
X
Yi
(Xi
i=1
X )Yi
7
7
7
7
7
5
388
Then,
"
^ = ^^
1
#
Analysis of variance:
= (W T X ) 1W T Y
2
=
6
4
The reparamterization does not change the
ANOVA table.
3
Y
(X X )Y2
(X X )
i
i
7
5
Note that
i
and
PX = X (X T X ) 1X T
= W (W T W ) 1W T = PW
V ar(^ ) = 2(W T W ) 1
2
=
6
4
2
n
0
2
0 (X X )2
i
3
and
7
5
R(0) + R(1j0) + SSE
X X )Y
Hence, Y and ^1 = (
(X X )2 are uncorrelated
(independent for the normal theory GaussMarkov model).
i
i
i
= YT P1Y + YT (PX
P1)Y + YT (I
PX )Y
= YT P1Y + YT (PW
P1)Y + YT (I
PW )Y
= R() + R(1j)+ SSE
390
389
7.2 Multiple regression analysis for the
normal-theory Gauss-Markov model
Suppose rank(X ) = r + 1, then
(i) the OLS estimator (b.l.u.e.) for is
where
^ = X b = X (X T X ) 1X T Y = PX Y
(iii) Y
(iv) e = Y
N (0; 2I )
Y = X + 1X T Y
(ii) V ar(b) = 2(X T X ) 1
i NID(0; 2) for i = 1; : : : ; n :
Matrix formulation:
where
b = (X T X )
Yi = 0 + 1X1i + + r Xri + i
^ = (I
Y
PX )Y
(v) By result 4.7,
2
1 X11 X21
6
6 1 X
12 X.22
6
X = 66 1 ..
.
6 .
..
..
4 .
1 X1n X2n
" "
1
X1
Xr 1
. Xr2
.
..
Xrn
"
"
X2
3
7
7
7
7
7
7
5
2
0
6
6 1
6 .
4 .
r
3
7
7
7
5
1
1
1
SSE = 2 eTe = 2 YT (I PX )Y
2
2(n r 1)
(vi) MSE = nSSE
r 1 is an unbiased estimator of
2.
Xr
391
392
Reduction in the residual sum of squares
obtained by regression on X1; X2; : : : ; Xr
is denoted as
R(1; 2; : : : ; r j 0)
ANOVA
Source of
variation
Model (regression
on X1; : : : ; Xr )
Error (or
residuals)
Corrected
total
Correction for
the mean
= Y T (I
d.f.
r
n 1
1
= Y T (P X
Sum of
squares
PX )Y
P1 )Y
Use Cochran's theorem or results 4.7 and 4.8
to show that SSE is distributed independently
of
R(1; : : : ; r j0)
= YT (PX P1)Y
n r 1 YT (I
P1)Y YT (I
PX )Y
and
YT (I
P1)Y
R(0) = YT P1Y
=nY 2
R(1; 2; : : : ; r j0) = SSmodel
1
SSE 2(n r 1)
2
and
1
R( ; : : : ; r j0) 2(r)(Æ2)
2 1
393
394
Note that
Then
R(1; : : : ; r j0)=r
F=
F(r;n r 1)(Æ2)
MSE
(I
=)
where
1 T T
X (PX I + I P1)X 2
1 h
= 2 T X T (I P1)X T XT (I
h
(I
1
Æ2 = 2 T XT (PX P1)X =)
=
PX )X %
This is a matrix of zeros
= 12 T X T (I P1)X = 12 T X T (I P1)(I P1)X = 12 [(I P1)X ]T (I P1)X 395
i
P1)X = (I P1)1(I P1)X1 (I P1)Xr
h
i
= 0 X1 X11 Xr Xr 1
1
Æ2 = 2
+
P1)X =
2
4
r
X
j =1
r
X
j =1
j2(Xj
X X
j 6=k
2
j (Xj
Xj 1)
Xj 1)T (Xj
j k (Xj
Xj 1)
Xj 1)T (Xk Xk 1)
3
n
1 T 4X
)(Xi X
)T 5 (Xj X
=
2
2
i=1
where
3
2
3
2
3
2
X1j
X1
1
7
6
6
6 . 7
.
.
= 4 . 5 Xj = 4 . 75
= 4 . 5 X
Xr
Xrj
r
396
i
If
n
X
j =1
)(Xj X
)T is positive denite,
(Xj X
then the null hypothesis corresponding to
Æ2 = 0 is
Sequential sums of squares (Type I sums of
squares in PROC GLM or PROC REG in SAS).
H0 : = 0( or 1 = 2 = = r = 0)
Let
Reject
H0 : = 0 if
YT (PX P1)Y=r
> F(r;n r
PX )Y=(n r 1)
F= T
Y (I
1); X0 = 1
X1 = [1jX1]
X2 = [1jX1jX2]
..
Xr = [1jX1j jXr ]
P0 = X0(X0T X0)
P1 = X1(X1T X1)
P2 = X2(X2T X2)
..
Pr = Xr (XrT Xr )
Then
= R(0) + R(1j0) + R(2j0; 1)
+ + R(r j0; 1; : : : ; r 1)
+SSE
Use Cochran's theorem to show
{ these sums of squares are distributed
independently of each other.
{ Each 12 R(ij0; : : : ; i 1) has a chisquared distribution with one degree of
freedom.
Use Result 4.7 to show 12 SSE 2(n r 1).
399
X1T
X2T
XrT
398
397
YT Y = YT P0Y + YT (P1 P0)Y + YT (P2 P1)Y
+ + YT (Pr Pr 1)Y + YT (I Pr )Y
1X0T
F=
R(j j0; : : : ; j 1)=1
F1;n r 1(Æ2)
MSE
where
1
2
1
= 2
1
= 2
Æ2 =
T X T (Pj Pj 1)X T X T (Pj Pj 1)T (Pj
[(Pj
Pj 1)X ]T (Pj
Pj 1)X Pj 1)X Hence, this is a test of
vs
H0 : (Pj
Pj 1)X = 0
Ha : (Pj
Pj 1)X 6= 0
400
Then
Note that
(Pj
Pj 1)X
h
= (Pj Pj 1) 1 X1
Xj
Xj
1
(Pj
Xr
i
h
= Onj (Pj
r
X
k=j
+
i
Pj 1)Xj Pj 1)Xr
(Pj
k (Pj
= j (Pj
= (Pj Pj 1)1 (Pj Pj 1)X1 Pj 1)Xj 1 (Pj Pj 1)Xj (Pj
h
Pj 1)X =
Pj 1)Xk
Pj 1)Xj
r
X
k=j +1
and the null hypothesis is
i
H0 : 0 = j (Pj Pj 1)Xj +
k (Pj
r
X
k=j +1
Pj 1)Xk
k (Pj Pj 1)Xk
402
401
From the previous discussion:
Type II sums of squares in SAS (these are also
Type III and Type IV sums of squares for
regression problems).
R(j j0 and all other k0 s) = YT (PX
F=
where
P j )Y
where
P j = X j (X T j X j ) X T j
and X j is obtained by deleting the (j + 1)-th
column of X .
P j )Y=1
F(1;n r 1)(Æ2)
MSE
1
2
1
= 2
Æ2 =
T X T (PX
P j )X j2XTj (PX
P j )Xj
This F-test provides a test of
H 0 : j = 0
if (PX
403
YT (PX
vs
HA : j 6= 0
P j )Xj 6= 0.
404
When X1; X2; : : : ; Xr are all uncorrelated, then
Type I Sums
Variable of squares
other 0s)
= (PX P 1)Y
X2
R(2j other 's)
= YT (PX P 2)Y
..
..
..
Xr
R(r j0; 1; : : : ; r 1) R(r j0; : : : ; r 1)
= YT (Pr Pr 1)Y = YT (PX P r)Y
Residuals SSE = YT (I PX )Y
X1
Corrected
Total
R(1j0)
=YT (P1 P0)Y
R(2j0; 1)
= YT (P2 P1)Y
Type II Sums
of squares
Y T (I
R(1j
YT
(i) R(j j 0 and any other subset of 's)
= R(j j0)
and there is only one ANOVA table.
(ii) R(j j0) = ^j2
Xj:)2
(Xji
i=1
j0)
F1;n k 1(Æ2)
(iii) F = R(MSE
j
where
P1)Y
n
X
Æ2 = 12 j2
n
X
(Xji
i=1
Xj:)2
and this F-statistic provides a test of
H0 : j = 0 versus HA : j 6= 0.
406
405
Testable Hypothesis
For any testable hypothesis, reject H0 : C = d
in favor of the general alternative HA : C 6= d
if
(C b d)T [C (X T X ) C T ] 1(C b d)=m
YT (I PX )Y=(n rank(X ))
> F(m;n rank(X ));
F =
where
m = number of rows in C
= rank(C )
and
Condence interval for an estimable
function cT q
cT b t(n rank(X ))=2 MSE cT (X T X ) c
b = (X T X ) X T Y
407
Use cT = (0 0 .. 0 1 0 .. 0)
"
j -th position
to construct a condence interval
for j 1
Use cT = (1; x1; x2; : : : ; xr ) to
construct a condence interval for
E (YjX1 = x1; : : : ; Xr = xr )
= 0 + 1x1 + + r xr
408
/* A SAS program to perform a regression
analysis of the effects of the
composition of Portland cement on the
amount of heat given off as the cement
hardens. Posted as cement.sas */
Prediction Intervals:
Predict a future observation at
i.e., predict
X1 = x1; : : : ; Xr = xr
Y = 0 + 1x1 + + r xr + %
estimate the
conditional
mean as
b0 + b1x1 + + br xr
"
estimate
this with
its mean
E () = 0
A (1 ) 100% prediction interval is
q
(cT b +0) t(n rank(X ));=2 MSE [1 + cT(XTX) c]
where
cT = (1 x1 xr )
data set1;
input run x1 x2 x3 x4 y;
/* label y = evolved heat (calories)
x1 = tricalcium aluminate
x2 = tricalcium silicate
x3 = tetracalcium aluminate ferrate
x4 = dicalcium silicate; */
cards;
1 7 26 6 60 78.5
2 1 29 15 52 74.3
3 11 56 8 20 104.3
4 11 31 8 47 87.6
5 7 52 6 33 95.9
6 11 55 9 22 109.2
7 3 71 17 6 102.7
8 1 31 22 44 72.5
410
409
9 2
10 21
11 1
12 11
13 10
run;
54
47
40
66
68
18
4
23
9
8
22
26
34
12
12
/* Regress y on all four explanatory
variables and check residual plots
and collinearity diagnostics */
93.1
115.9
83.8
113.2
109.4
proc reg data=set1 corr;
model y = x1 x2 x3 x4 / p r ss1 ss2
covb collin;
output out=set2 residual=r
predicted=yhat;
run;
proc print data=set1 uniform split='*';
var y x1 x2 x3 x4;
label y = 'Evolved*heat*(calories)'
x1 = 'Percent*tricalcium*aluminate'
x2 = 'Percent*tricalcium*silicate'
x3 = 'Percent*tetracalcium*aluminate*ferrate'
x4 = 'Percent*dicalcium*silicate';
run;
411
/* Examine smaller regression models
corresponding to subsets of the
explanatory variables */
proc reg data=set1;
model y = x1 x2 x3 x4 /
selection=rsquare cp aic
sbc mse stop=4 best=6;
run;
412
/* Regress y on two of explanatory
variables and check residual plots
and collinearity diagnostics */
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
proc reg data=set1 corr;
model y = x1 x2 / p r ss1 ss2
covb collin;
output out=set2 residual=r
predicted=yhat;
run;
Percent
Percent tetracalcium Percent
tricalcium aluminate dicalcium
silicate
ferrate
silicate
Evolved
Percent
heat
tricalcium
(calories) aluminate
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
83.8
113.2
109.4
7
1
11
11
7
11
3
1
2
21
1
11
10
26
29
56
31
52
55
71
31
54
47
40
66
68
/* Use the GLM procedure to identify
all estimable functions */
6
15
8
8
6
9
17
22
18
4
23
9
8
60
52
20
47
33
22
6
44
22
26
34
12
12
Correlation
Variable
proc glm data=set1;
model y = x1 x2 x3 x4 / ss1 ss2 e1 e2 e p;
run;
x1
x2
x3
x4
y
x1
x2
x3
x4
y
1.0000
0.2286
-0.8241
-0.2454
0.7309
0.2286
1.0000
-0.1392
-0.9730
0.8162
-0.8241
-0.1392
1.0000
0.0295
-0.5348
-0.2454
-0.9730
0.0295
1.0000
-0.8212
0.7309
0.8162
-0.5348
-0.8212
1.0000
413
414
Parameter Estimates
The REG Procedure
Model: MODEL1
Dependent Variable: y
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
4
8
12
2664.52051
47.67641
2712.19692
666.13013
5.95955
Root MSE
Dependent Mean
Coeff Var
2.44122
95.41538
2.55852
R-Square
Adj R-Sq
415
0.9824
0.9736
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Intercept
x1
x2
x3
x4
1
1
1
1
1
63.16602
1.54305
0.50200
0.09419
-0.15152
69.93378
0.74331
0.72237
0.75323
0.70766
0.90
2.08
0.69
0.13
-0.21
Pr > |t|
0.3928
0.0716
0.5068
0.9036
0.8358
Variable
DF
Type I SS
Type II SS
Intercept
x1
x2
x3
x4
1
1
1
1
1
118353
1448,75413
1205.70283
9.79033
0.27323
4.86191
25.68225
2.87801
0.09319
0.27323
416
Collinearity Diagnostics
Number
Eigenvalue
Condition
Index
1
2
3
4
5
4.11970
0.55389
0.28870
0.03764
0.00006614
1.00000
2.72721
3.77753
10.46207
249.57825
Obs
Dep Var
y
Predicted
Value
Std Error
Mean Predict
Student
Residual
1
2
3
4
5
6
7
8
9
10
11
12
13
78.5000
74.3000
104.3000
87.6000
95.9000
109.2000
102.7000
72.5000
93.1000
115.9000
83.8000
113.2000
109.4000
78.4929
72.8005
105.9744
89.3333
95.6360
105.2635
104.1289
75.6760
91.7218
115.6010
81.8034
112.3007
111.6675
1.8109
1.4092
1.8543
1.3265
1.4598
0.8602
1.4791
1.5604
1.3244
2.0431
1.5924
1.2519
1.3454
0.00432
0.752
-1.054
-0.846
0.135
1.723
-0.736
-1.692
0.672
0.224
1.079
0.429
-1.113
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
Collinearity Diagnostics
---------------Proportion of Variation---------------Intercept
x1
x2
x3
x4
1
2
3
4
5
0.000005
8.812E-8
3.060E-7
0.000127
0.99987
0.00037
0.01004
0.000581
0.05745
0.93157
0.00002
0.00001
0.00032
0.00278
0.99687
0.00021
0.00266
0.00159
0.04569
0.94985
0.00036
0.00010
0.00168
0.00088
0.99730
|
|
|
|
|
|
|
|
|
|
|
|
|
-2-1 0 1 2
|
|*
**|
*|
|
|***
*|
***|
|*
|
|**
|
**|
|
|
|
|
|
|
|
|
|
|
|
|
|
Cook's D
0.000
0.057
0.303
0.060
0.002
0.084
0.063
0.395
0.038
0.023
0.172
0.013
0.108
418
417
The REG Procedure
Model: MODEL1
R-Square Selection Method
Regression Models for Dependent Variable: y
Number in
Model
R-Square
AIC
SBC
Variables
in Model
1
0.6744
58.8383
59.96815 x4
1
0.6661
59.1672
60.29712 x2
1
0.5342
63.4964
64.62630 x1
1
0.2860
69.0481
70.17804 x3
-----------------------------------------------------2
0.9787
25.3830
27.07785 x1 x2
2
0.9726
28.6828
30.37766 x1 x4
2
0.9353
39.8308
41.52565 x3 x4
2
0.8470
51.0247
52.71951 x2 x3
2
0.6799
60.6172
62.31201 x2 x4
2
0.5484
65.0933
66.78816 x1 x3
------------------------------------------------------3
0.9824
24.9187
27.17852 x1 x2 x4
3
0.9823
24.9676
27.22742 x1 x2 x3
3
0.9814
25.6553
27.91511 x1 x3 x4
3
0.9730
30.4953
32.75514 x2 x3 x4
------------------------------------------------------4
0.9824
26.8933
29.71808 x1 x2 x3 x4
419
This output was produced by the e option
in the model statement of the GLM procedure.
It indicates that all five regression
parameters are estimable.
The GLM Procedure
General Form of Estimable Functions
Effect
Coefficients
Intercept
L1
x1
L2
x2
L3
x3
L4
x4
L5
420
This output was produced by the e1 option
in the model statement of the GLM procedure.
It describes the null hypotheses that are
tested with the sequential Type I sums of
squares.
Type II Estimable Functions
----Coefficients---x1
x2
x3
x4
Effect
Type I Estimable Functions
Intercept
0
0
0
0
Effect
----------------Coefficients---------------x1
x2
x3
x4
x1
L2
0
0
0
Intercept
0
0
0
0
x2
0
L3
0
0
x1
L2
0
0
0
x2
0.6047*L2
L3
0
0
x3
0
0
L4
0
x3
-0.8974*L2
0.0213*L3
L4
0
x4
0
0
0
L5
x4
-0.6984*L2
-1.0406*L3
-1.0281*L4
L5
421
>
>
>
>
>
>
# The commands are posted as:
#
#
#
#
cement.spl
The data file is stored under the name
cement.dat. It has variable names on the
first line. We will enter the data into
a data frame.
> cement <- read.table("cement.txt", header=T)
> cement
1
2
3
4
5
6
7
8
9
run
1
2
3
4
5
6
7
8
9
X1
7
1
11
11
7
11
3
1
2
X2
26
29
56
31
52
55
71
31
54
X3
6
15
8
8
6
9
17
22
18
X4
60
52
20
47
33
22
6
44
22
422
10
11
12
13
10
11
12
13
21
1
11
10
47
40
66
68
4
23
9
8
26
34
12
12
115.9
83.8
113.2
109.4
> # Compute correlations and round the results
> # to four significant digits
> round(cor(cement[-1]),4)
Y
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
X1
X2
X3
X4
Y
423
X1
1.0000
0.2286
-0.8241
-0.2454
0.7309
X2
0.2286
1.0000
-0.1392
-0.9730
0.8162
X3
-0.8241
-0.1392
1.0000
0.0295
-0.5348
X4
-0.2454
-0.9730
0.0295
1.0000
-0.8212
Y
0.7309
0.8162
-0.5348
-0.8212
1.0000
424
> # Create a scatterplot matrix with smooth
> # curves. Unix users should first use
> # motif( ) to open a graphics wundow
> points.lines <- function(x, y)
+ {
+ points(x, y)
+ lines(loess.smooth(x, y, 0.90))
+ }
> par(din=c(7,7),pch=18,mkh=.15,cex=1.2,lwd=3)
> pairs(cement[ ,-1], panel=points.lines)
425
10
30
50
15
20
30 40 50 60 70
426
> cement.out <- lm(Y~X1+X2+X3+X4, cement)
> summary(cement.out)
70
5
10
X1
> # Fit a linear regression model (Venables
> # and Ripley, Chapter 6)
50
60
Call: lm(formula = Y ~ X1+X2+X3+X4, data=cement)
Residuals:
Min
1Q Median
3Q Max
-3.176 -1.674 0.264 1.378 3.936
15
20
30
40
X2
40
50
60
5
10
X3
Value Std. Error t value Pr(>|t|)
(Intercept) 63.1660 69.9338
0.9032 0.3928
X1 1.5431 0.7433
2.0759 0.0716
X2 0.5020 0.7224
0.6949 0.5068
X3 0.0942 0.7532
0.1250 0.9036
X4 -0.1515 0.7077
-0.2141 0.8358
100
110
10
20
30
X4
Coefficients:
80
90
Y
5
10
15
20
5
10
15
20
80 90
110
Residual standard error: 2.441 on 8 d.f.
Multiple R-Squared: 0.9824
F-statistic: 111.8 on 4 and 8 degrees of freedom,
the p-value is 4.707e-007
427
Correlation of
(Intercept)
X1 -0.9678
X2 -0.9978
X3 -0.9769
X4 -0.9983
Coefficients:
X1
X2
X3
>
>
>
>
>
0.9510
0.9861 0.9624
0.9568 0.9979 0.9659
> anova(cement.out)
#
#
#
#
#
Create a function to evaluate an orthogonal
projection matrix. Then create a function
to compute type II sums of squares.
This uses the ginv( ) function in the MASS
library, so you must attach the MASS library
> library(MASS)
Analysis of Variance Table
Response: Y
Terms added sequentially (first
Df Sum of Sq Mean Sq
X1 1 1448.754 1448.754
X2 1 1205.703 1205.703
X3 1
9.790
9.790
X4 1
0.273
0.273
Residuals 8
47.676
5.960
to last)
F Value
243.0978
202.3144
1.6428
0.0458
Pr(F)
0.0000
0.0000
0.2358
0.8358
>
>
>
>
>
>
+
>
#=======================================
# project( )
#-------------# calculate orthogonal projection matrix
#=======================================
project <- function(X)
{ X%*%ginv(crossprod(X))%*%t(X) }
#=======================================
428
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
#========================================
# typeII.SS( )
#-----------------# calculate Type II sum of squares
#
# input lmout = object made by the
#
lm( ) function
#
y = dependent variable
#========================================
typeII.SS <- function(lmout,y)
{
# generate the model matrix
model <- model.matrix(lmout)
# create list of parameter names
par.name <- dimnames(model)[[2]]
# compute number of parameters
n.par <- dim(model)[2]
# Compute residual mean square
SS.res <- deviance(lmout)
df2
<- lmout$df.resid
MS.res <- SS.res/df2
430
429
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
>
result <- NULL
# store results
# Compute Type II SS
for (i in 1:n.par) {
A <- project(model)-project(model[,-i])
SS.II <- t(y) %*% A %*% y
df1
<- qr(project(model))$rank qr(project(model[ ,-i]))$rank
MS.II <- SS.II/df1
F.stat <- MS.II/MS.res
p.val <- 1-pf(F.stat,df1,df2)
temp <- cbind(df1,SS.II,MS.II,F.stat,p.val)
result <- rbind(result,temp)
}
result<-rbind(result,c(df2,SS.res,MS.res,NA,NA))
dimnames(result)<-list(c(par.name,"Residual"),
c("Df","Sum of Sq","Mean Sq","F Value","Pr(F)"))
cat("Analysis of Variance (TypeII Sum of Squares)
\n")
round(result,6)
}
#==========================================
431
>
>
>
>
>
> typeII.SS(cement.out, cement$Y)
Analysis of
Df
(Inter.) 1
X1 1
X2 1
X3 1
X4 1
Residual 8
Variance (TypeII Sum of Squares)
Sum of Sq Mean Sq F Value
Pr(F)
4.861907 4.861907 0.815818 0.392790
25.682254 25.682254 4.309427 0.071568
2.878010 2.878010 0.482924 0.506779
0.093191 0.093191 0.015637 0.903570
0.273229 0.273229 0.045847 0.835810
47.676412 5.959551
NA
NA
#
#
#
#
#
Venables and Ripley have supplied functions
studres( ) and stdres( ) to compute
studentized and standardized residuals.
You must attach the MASS library before
using these functions.
> cement.res <- cbind(cement$Y,cement.out$fitted,
+
cement.out$resid,
+
studres(cement.out),
+
stdres(cement.out))
> dimnames(cement.res) <- list(cement$run,
+
c("Response","Predicted","Residual",
+
"Stud. Res.","Std. Res."))
> round(cement.res,4)
432
1
2
3
4
5
6
7
8
9
10
11
12
13
Response
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
83.8
113.2
109.4
Predicted
78.4929
72.8005
105.9744
89.3333
95.6360
105.2635
104.1289
75.6760
91.7218
115.6010
81.8034
112.3007
111.6675
Residual Stud. Res.
0.0071
0.0040
1.4995
0.7299
-1.6744 -1.0630
-1.7333 -0.8291
0.2640
0.1264
3.9365
2.0324
-1.4289 -0.7128
-3.1760 -1.9745
1.3782
0.6472
0.2990
0.2100
1.9966
1.0919
0.8993
0.4061
-2.2675 -1.1326
434
433
Std. Res.
0.0043
0.7522
-1.0545
-0.8458
0.1349
1.7230
-0.7358
-1.6917
0.6721
0.2237
1.0790
0.4291
-1.1131
> # Produce plots for model diagnostics including
> # Cook's D. Unix users should first use motif()
> # to open a graphics window
> par(mfrow=c(3,2))
> plot(cement.out)
> # Search for a simpler model
> cement.stp <- step(cement.out,
+
scope=list(upper = ~X1 + X2 + X3 + X4,
+
lower = ~ 1), trace=F)
435
2.0
4
6
6
1.5
1.0
13
0.5
0
-2
Residuals
2
sqrt(abs(Residuals))
8
13
8
80
90
100
110
80
90
100
110
fits
4
Fitted : X1 + X2 + X3 + X4
0
-2
80
90
Y
Residuals
100
2
110
6
13
8
80
90
100
110
-1
Fitted : X1 + X2 + X3 + X4
> cement.stp$anova
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
Y ~ X1 + X2 + X3 + X4
Final Model:
Y ~ X1 + X2
Step Df Deviance Resid. Df Resid. Dev
AIC
1
8 47.67641 107.2719
2 - X3 1 0.093191
9 47.76960 95.4460
3 - X4 1 9.970363
10 57.73997 93.4973
437
0.4
0.3
0.6
0.2
f-value
0.1
11
0.0
-20
0.2
8
3
0.2
20
10
0
-10
Cook’s Distance
20
10
-10
-20
436
1
Residuals
Y
0
Fitted Values
0
Quantiles of Standard Normal
0.6
2
4
6
8
Index
10
12
Download