7. Regression Analysis 7.1

advertisement
7. Regression Analysis
The OLS estimator (b.l.u.e.) for
is
b = (X T X ) 1 X T Y
7.1 Simple linear regression for
normal-theory Gauss-Markov
models.
" when does this exist?
Model 1:
Yi = 0 + 1Xi + i
where i NID(0; 2)
for i = 1; : : : ; n.
Matrix formulation:
Y1
1 X1
Y2 = 1 X2
..
.. ..
or
2
3
2
3
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
7
7
7
7
7
7
7
7
7
7
7
7
7
5
6
6
6
6
6
6
6
6
6
6
6
6
6
4
7
7
7
7
7
7
7
7
7
7
7
7
7
5
6
6
6
6
6
6
6
6
6
6
6
6
6
4
Yn
Xn
1
1
2
0 + .
.
1
2
3
6
6
6
4
7
7
7
5
n
Here
2
XT X =
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
6
6
6
6
6
6
6
6
6
4
n
Xi
i=1
n 2
X
i=1 i
n
n
X
i=1 i
X
X
X
n
Yi
T
X Y = i=1
n
XY
i=1 i i
2
X
6
6
6
6
6
6
6
6
6
4
Y =X +
X
3
7
7
7
7
7
7
7
7
7
5
3
7
7
7
7
7
7
7
7
7
5
421
420
Then
(X T X )
b = (X T X ) 1X T Y =
1
2
=
n
n 2
X
i=1 i
X
1
n
X
i=1 i
0
B
@
X
2
1
C
A
6
6
6
6
6
6
6
6
6
4
2
=
n
X
i=1
Xi2
n
X
i=1 i
n 2
X
i=1 i
X
X
n
n
1
(X X )2
i=1 i
X
6
6
6
6
6
6
4
nX
n
X
i=1
n
nX
n
3
7
7
7
7
7
7
5
422
Xi
3
7
7
7
7
7
7
7
7
7
5
n
1
n
(X
i=1 i
X
X )2
Xi2 n Yi nX n XiYi
i=1
i=1
i=1
nX n Yi + n n XiYi
2
0
6
6
6
6
6
6
6
6
4
B
@
n
and
2
2
b=
6
6
6
6
6
6
4
b0
b1
3
7
7
7
7
7
7
5
1 0
X
=
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
C B
A @
1
X
X
C
A
X
X
i=1
i=1
Y b1X
n
)Yi
(Xi X
i=1
n
)2
(Xi X
i=1
X
X
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
423
3
7
7
7
7
7
7
7
7
5
Covariance matrix
V ar(b)
=
V ar (X T X ) 1X T Y
0
1
@
A
= (X T X )
=
1X T (2I )X (X T X ) 1
2(X T X ) 1
X 2
X
2 (X
n
(
X
X
)
i
i X )2
= 2
1
X
(Xi X )2 (Xi X )2
2
6
6
6
6
6
6
6
6
6
6
4
1+
3
7
7
7
7
7
7
7
7
7
7
5
where
MSE = SSE=(n
=
n
1
2
Y(I
2)
PX )Y:
Estimate the covariance matrix for
b as
Sb = MSE (X T X ) 1
425
424
Analysis of Variance:
Notation
n
Y 2 = YT Y
i=1 i
= YT (I PX + PX P1 + P1)Y
= YT (I PX )Y + YT (PX P1)Y + YT P1Y
X
"
SSE
%
"Corrected
model"
sum of
squares
"
call this
R( 1 j
"
Correction
for the
"mean"
"
call this
R(
0)
0)
(i) By Cochran's Theorem, these three
sums of squares are multiples of independent chi-squared random variables.
(ii) By result 4.7, 12 SSE 2(n 2) if the
model is correctly speci ed.
426
Reduction in residual sum
of squares:
R( k+1; : : : ; k+q j 0; 1; : : : ; k)
= Y T (I
"
sum of squared
residuals for the
smaller model
Here
YT (I
PX 1)Y
"
sum of squared
residuals for the
larger model
X = [ X1 j X2
%
-
columns
corresponding
to 0; 1; : : : ; k
PX )Y
]
columns
corresponding
to k+1 k+q
427
Correction for the overall mean:
An alternative formula is
R( 0) = YT P1Y
= YT (I I + P1)Y
= YT I Y
n
=
(Y
i=1 i
R( 0 )
YT (I
P1)Y
n
0)2
(Yi Y )2
i=1
%
X
X
sum of squared
residuals from tting
the model
Yi = + i.
The OLS estimator for = E (Yi) is
^
= (1T 1)
= (n)
1
1T Y
n
Y
i=1 i
1
0
B
@
X
1
C
A
= YT P1Y
11T Y
= YT 1(1T 1)
n
=
Y
( n) 1
Y
i
i=1
i=1 i
2
n
= (n ) 1
Y
i
i=1
= nY 2
0
B
B
@
n
X
1
0
C
C
A
B
B
@
0
B
B
@
C
C
A
= Y
= YT (PX
=
=
P1)Y
YT (PX I + I P1)Y
YT (I P1 (I PX ))Y
YT (I -P1)Y YT (I -PX )Y
%
%
sum of squared
sum of squared
residuals for
residuals for
tting the model tting the model
Yi = + i
C
C
A
with df = rank(P1) = rank(1) = 1:
Reduction in the residual sum of
squares for regression on X1:
=
1
1
X
429
428
R( 1 j 0 )
X
Yi =
0
ANOVA table:
Source of
variation
d.f.
Regression
on X
1 R(
Residuals
n 2
Corrected
total
n 1
Correction
for the mean 1
Sum of
Squares
1
j 0) = YT (PX P1)Y
YT (I PX )Y
YT (I P1)Y
YT P1Y = nY 2
+ 1Xi + i
430
431
F-tests
From result 4.7 we have
1
1
2 R( 0 ) = 2
YT P1Y 2(Æ 2)
1
where
Æ2
=
=
=
=
1
2
T XT P1X
1
2
T XT P T P1X
Also use Result 4.7 to show that
1
1
T
2 (P1X ) (P1X
n ( + B X )2
2 0 1
Hypothesis test:
Reject H0 : 0 + 1X = 0 if
( 0)
> F(1;n 2);
F = RMSE
1
1
T
2 SSE = 2 Y (I
)
PX )Y 2(n 2)
432
Use Result 4.8 to show that
433
Test the null hypothesis H0 : 1 = 0
Use
F = R( 1j 0)=1
SSE = 12 YT (I PX )Y
is distributed independently of
MSE
[YT (PX P1)Y]=[1 2]
=
]
[YT (I PX )Y]=[(n 2) 2
R( 0) = 12 YT P1Y :
This follows from
(I
F(1;n
PX )P1 = 0 :
Consequently,
( 0)
F(1;n 2)(Æ2)
F = RMSE
and this becomes a central
F-distribution when the null
hypothesis is true.
2
2)(Æ )
where
Æ2 = 12 T X T (PX
=
1
2
T X T (PX
P1)X
P1)T (PX P1)X
%
The null hypothesis is
H0 : (PX P1)X = 0
434
435
If any Xi =
6 Xj , then we cannot
have both
Here
Xj X = 0
(PX P1)X
= (PX
and
P1)[1jX]
= [(PX P1)1 j (PX P1)X]
= [PX 1 P11 j PX X P1X]
1]
= [1 1 j X X
2
=
6
6
6
6
6
6
6
6
6
6
6
6
6
4
0
0
..
0
j X1 X
j X2 . X
j .
j Xn X
Xi = X = 0 :
Consequently, if any Xi 6= Xj then
(PX
7
7
7
7
7
7
7
7
7
7
7
7
7
5
1=0 :
Hence, the null hypothesis is
H0 : 1 = 0:
436
X
n
Æ2 = 2 12 i=1
(Xi
X
437
Reparameterize the model:
X )2
Yi =
+
1(Xi
X ) + i
with i NID(0; 2); i = 1; : : : ; n.
Maximize the power of the F-test
for H0 : 1 = 0 vs. HA : 1 6= 0
by maximizing
1
=0
if and only if
3
Note that
n
Æ2 = 12 12 i=1
(Xi
P1)X
X )2
Interpretation of parameters:
= E (Y )
when X = X
1 is the change in E (Y ) when
X is increased by one unit.
438
Matrix formulation:
2
6
6
6
6
6
6
6
6
4
Y1
3
2
1
X1 X
1
Xn X
.. = ..
Yn
7
7
7
7
7
7
7
7
5
6
6
6
6
6
6
6
6
4
..
3
7
7
7
7
7
7
7
7
5
2
2
3
6
6
6
4
7
7
7
5
1
+
6
6
6
6
6
6
6
6
4
1
..
n
3
For this reparameterization, the
columns of W are orthogonal and
7
7
7
7
7
7
7
7
5
2
WTW
or
Y
=W +
Clearly,
W
=
X
=
X
2
6
6
6
4
W
2
6
6
6
4
X
1
0
1
3
7
7
7
5
=
6
6
6
6
6
6
6
4
2
(W T W )
n
0
1
1 = n
6
6
6
6
6
6
6
4
0
n
X
i=1
0
3
X )2
(Xi
3
0
1
(Xi X )2
n
Yi
W T Y = i=1
n
(X
i=1 i
7
7
7
7
7
7
7
5
2
= XF
6
6
6
6
6
6
6
6
6
4
1 X
= WG
0 1
3
7
7
7
7
7
7
7
5
3
X
X )Yi
X
7
7
7
7
7
7
7
7
7
5
7
7
7
5
439
440
Analysis of Variance
Then,
^
^= ^
1
2
3
6
6
6
6
4
7
7
7
7
5
= (W T X )
2
=
6
6
6
6
6
6
6
4
and
1W T Y
Y
The reparamterization does not
change the ANOVA table.
PX = X (X T X ) 1X T
3
(Xi X )Yi
(Xi X )2
7
7
7
7
7
7
7
5
=
V ar(^ )
=
2(W T W ) 1
2
=
6
6
6
6
6
6
6
6
6
4
2
n
0
0
2
and
3
7
7
7
7
7
7
7
7
7
5
(Xi X )2
Xi X )Yi
Hence, Y and ^1 = (
(Xi X )2
are uncorrelated (independent for
the normal theory Gauss-Markov
model).
441
W (W T W ) 1 W T
= PW
R( 0) + R( 1j 0) + SSE
= YT P1Y + YT (PX P1)Y +
YT (I PX )Y
= YT P1Y + YT (PW P1)Y +
YT (I PW )Y
= R( ) + R( 1j ) + SSE
442
Matrix formulation:
7.2 Multiple regression analysis
for the normal-theory
Gauss-Markov model
N (0; 2I )
Y =X +
where
Yi = 0 + 1X1i + + r Xri + i
where
1
1
= 1
2
X
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
i NID(0; 2) for i = 1; : : : ; n :
X11 X21 Xr1
X12 X22 Xr2
.
. .
.
.
..
..
1 X1n X2n
..
Xrn
"
"
" "
1
.
..
X1
X2
(iv) e = Y
(i) the OLS estimator (b.l.u.e.) for
is
b = (X T X ) 1X T Y
(v) By result 4.7,
= X (X T X )
= PX Y
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
0
..1
r
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
444
Suppose rank(X ) = r + 1, then
(iii) Y^ = X b
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
Xr
443
(ii) V ar(b) = 2(X T X ) 1
3
^ = (I
Y
PX )Y
1
SSE = 2 eT e
2
1
=
1
YT (I PX )Y
2
2(n r 1)
(vi) MSE = n SSE
r 1 is an unbiased
estimator of 2.
1X T Y
445
446
ANOVA
Source of
variation
Model
(regression
on X1; : : : ; Xr )
Error (or
residuals)
Corrected
total
d.f.
r
Reduction in the residual sum of
squares obtained by regression on
Sum of
squares
R( 1 ; : : : ; r j 0 )
= YT (PX P1)Y
n r 1 YT (I PX )Y
n 1
Correction
for the mean
1
YT (I P1)Y
R( 0) = YT P1Y
=nY 2
X1; X2; : : : ; Xr
is denoted by
R( 1 ; 2 ; : : : ; r j 0 )
= YT (I P1)Y YT (I PX )Y
= YT (PX P1)Y
447
448
Then
Use Cochran's theorem or results
4.7 and 4.8 to show that SSE is
distributed independently of
R( 1; 2; : : : ; r j 0) = SSmodel
F
=
R( 1; : : : ; rj 0)=r F
where
Æ2 = 12 T XT (PX
=
and
1
2
2 SSE (n r 1)
MSE
=
1
T X T (PX
2
1
2
2
4
2
(r;n r 1)(Æ )
P1)X
I + I P1)X
T X T (I
P1)X
T XT (I PX )X
3
5
-
and that
This is a matrix of zeros
1
2 2
2 R( 1; : : : ; r j 0) (r)(Æ )
449
=
=
=
1 T T
2 X (I P1)X
1 T T
2 X (I P1)(I P1)X
T
1
2 [(I P1)X ] (I P1)X
450
Note that (I
P1)X is
[(I P1)1j(I P1)X1j j(I P1)Xr ]
1 1 j j Xr X
r 1]
= [ 0 j X1 X
where
1
= ..
=)
(I
P1)X
=
r
(X
j =1 j j
X
1
X j 1)
r
X
XX
=
T n (X
i=1 j
2
2
1
2
6
6
4
X
2
7
7
7
7
7
7
7
7
5
6
6
6
6
6
6
6
6
4
r
=
X
X 1
..
X r
3
2
7
7
7
7
7
7
7
7
5
6
6
6
6
6
6
6
6
4
Xj =
X1j
..
Xrj
3
7
7
7
7
7
7
7
7
5
X
2
T
2 j =1 j (Xj Xj 1) (Xj Xj 1)
+
(X
X j 1)T (Xk X k1)
j 6=k j k j
6
6
6
4
3
6
6
6
6
6
6
6
6
4
)(Xj X
)T is
If n (Xj X
j =1
positive de nite, then the null hypothesis corresponding to Æ2 = 0 is
Then, the noncentrality parameter
is
2
2
)(Xi
X
)T
X
3
7
7
5
3
7
7
7
5
H0 : = 0( or 1 = 2 = = r = 0)
Reject H0 : = 0 if
T
F = YT (IY (PPX)Y=P(1n)Y=r
r 1)
X
> F(r;n r 1);
451
Sequential sums of squares (Type I
sums of squares in PROC GLM or
PROC REG in SAS).
452
Then
YT Y = YT P0Y + YT (P1
+YT (P2
P0)Y
P1)Y + +YT (Pr
Pr 1)Y
+YT (I Pr )Y
De ne
X0 = 1
X1 = [1jX1]
X2 = [1jX1jX2]
..
P0 = X0(X0T X0) 1X0T
P1 = X1(X1T X1) X1T
P2 = X2(X2T X2) X2T
..
Xr = [1jX1j jXr ] Pr = Xr (XrT Xr ) XrT
453
=
R( 0 ) + R( 1 j 0 ) + R( 2 j 0 ; 1 )
+ + R( r j 0 ; 1 ; : : : ; r 1 )
+SSE
454
Then
F
Use Cochran's theorem to show
{ these sums of squares are
distributed independently of
each other.
{ Each 12 R( ij 0; : : : ; i 1) has
a chi-squared distribution with
one degree of freedom.
Use Result 4.7 to show
1
2
2 SSE (n r 1).
=
R( j j 0; : : : ; j 1)=1 F
1;n r 1(Æ2)
MSE
where
Æ2 = 12
= 1
2
T X T (Pj
Pj 1)X
T X T (Pj
Pj 1)T (Pj Pj 1)X
= 12 [(Pj Pj 1)X ]T (Pj Pj 1)X
Hence, this is a test of
H0 : (Pj Pj 1)X
=0
vs
Ha : (Pj Pj 1)X 6= 0
456
455
Then
Note that
(Pj Pj 1)X
= (Pj Pj 1) 1 X1 Xj 1 Xj Xr
(Pj
X
= (Pj Pj 1)1 (Pj Pj 1)X1 (Pj Pj 1)Xj 1 (Pj Pj 1)Xj
Pj 1)X
r
=
(P
Pj 1)Xk
k =j k j
= j (Pj Pj 1)Xj
r
+
(P
Pj 1)Xk
k=j +1 k j
= Onj (Pj Pj 1)Xj (Pj Pj 1)Xr
X
and the null hypothesis is
H0 : 0
=
j (Pj
+
457
Pj 1)Xj
r
(P
k=j +1 k j
X
Pj 1)Xk
458
Type II sums of squares in SAS
(these are also Type III and
Type IV sums of squares for
regression problems).
From the previous discussion:
F
=
YT (PX P j )Y=1
MSE
F(1;n
2
r 1)(Æ )
where
R( j j 0 and all other k0 s)
= YT (PX P j )Y
Æ2
1
=
2
P j = X j (X T j X j ) X T j
P j )X
1
2 T
2 j Xj (PX P j )Xj
=
where
T X T (PX
This F-test provides a test of
and X j is obtained by deleting the
(j + 1)-th column of X .
HA : j 6= 0
H0 : j = 0
vs
if (PX P j )Xj 6= 0.
459
460
When X1; X2; : : : ; Xr are all
uncorrelated, then
Type I Sums
of squares
Type II Sums
of squares
X1
R( 1 j
X2
R( 2 j 0 ; 1 )
= Y T (P2
P1)Y
R( 1j other 0s)
=YT (PX P 1)Y
R( 2j other 's)
= Y T (PX
P 2)Y
Variable
..
Xr
Residuals
Corrected
Total
0)
=YT (P1
P0)Y
..
..
R( r j 0 ; 1 ; : : : ; r 1 ) R( r j 0 ; : : : ; r 1 )
= Y T (Pr
Pr 1)Y
= Y T (PX
P r)Y
SSE = YT (I PX )Y
YT (
I P1)Y
(i) R( j
j
0 and any other 's)
= R( j j 0 )
There is only one ANOVA table.
(ii) R( j j 0) = ^j2 n (Xji
i=1
X
X j:)2
j j 0) F
(iii) F = R(MSE
1;n k 1(Æ2)
where Æ2 = 12 j2 n (Xji X j:)2
i=1
provides a test of
H0 : j = 0 versus HA : j 6= 0.
X
461
462
Testable Hypothesis
For any testable hypothesis, reject
= d in favor of the general
alternative HA : C 6= d if
H0 : C
T [C (X T X ) C T ] 1(C b d)=m
F = (C b d)
YT (I PX )Y=(n rank(X ))
> F(m;n rank(X ));
where
m = number of rows in C
= rank(C )
and
b = (X T X )
XT Y
Con dence interval for an
estimable function cT
cT bt
T T
(n rank(X )) =2 MSE c (X X ) c
v
u
u
t
Use cT = (0 0 .. 0 1 0 .. 0)
"
j -th position
to construct a con dence
interval for j 1
Use cT = (1; x1; x2; : : : ; xr) to
construct a con dence interval
for
E (YjX1 = x1; : : : ; Xr = xr)
= 0 + 1x1 + + r xr
463
464
Prediction Intervals
A (1
is
Predict a future observation at
X1 = x1; : : : ; Xr = xr
=
v
u
u
t
0 + 1x1 + + r xr + %
estimate the
conditional
mean as
b0 + b1x1 + + br xr
100% prediction interval
(cT b + 0) t(n rank(X )); =2
i.e., predict
Y
)
%
estimate
this with
its mean
E () = 0
465
MSE [1 + cT(XTX) c]
where
cT = (1
x1 xr )
466
/* A SAS program to perform a regression
analysis of the effects of the
composition of Portland cement on the
amount of heat given off as the cement
hardens. Posted as cement.sas */
data set1;
input run x1 x2 x3 x4 y;
/* label y = evolved heat (calories)
x1 = tricalcium aluminate
x2 = tricalcium silicate
x3 = tetracalcium aluminate ferrate
x4 = dicalcium silicate; */
cards;
1 7 26 6 60 78.5
2 1 29 15 52 74.3
3 11 56 8 20 104.3
4 11 31 8 47 87.6
5 7 52 6 33 95.9
6 11 55 9 22 109.2
7 3 71 17 6 102.7
8 1 31 22 44 72.5
9 2
10 21
11 1
12 11
13 10
run;
54
47
40
66
68
18
4
23
9
8
22
26
34
12
12
93.1
115.9
83.8
113.2
109.4
proc print data=set1 uniform split='*';
var y x1 x2 x3 x4;
label y = 'Evolved*heat*(calories)'
x1 = 'Percent*tricalcium*aluminate'
x2 = 'Percent*tricalcium*silicate'
x3 = 'Percent*tetracalcium*aluminate*ferrate'
x4 = 'Percent*dicalcium*silicate';
run;
468
467
/* Regress y on all four explanatory
variables and check residual plots
and collinearity diagnostics */
/* Regress y on two of explanatory
variables and check residual plots
and collinearity diagnostics */
proc reg data=set1 corr;
model y = x1 x2 x3 x4 / p r ss1 ss2
covb collin;
output out=set2 residual=r
predicted=yhat;
run;
proc reg data=set1 corr;
model y = x1 x2 / p r ss1 ss2
covb collin;
output out=set2 residual=r
predicted=yhat;
run;
/* Examine smaller regression models
corresponding to subsets of the
explanatory variables */
/* Use the GLM procedure to identify
all estimable functions */
proc reg data=set1;
model y = x1 x2 x3 x4 /
selection=rsquare cp aic
sbc mse stop=4 best=6;
run;
proc glm data=set1;
model y = x1 x2 x3 x4 / ss1 ss2 e1 e2 e p;
run;
469
470
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
Percent
Percent tetracalcium Percent
tricalcium aluminate dicalcium
silicate
ferrate
silicate
Evolved
Percent
heat
tricalcium
(calories) aluminate
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
83.8
113.2
109.4
7
1
11
11
7
11
3
1
2
21
1
11
10
26
29
56
31
52
55
71
31
54
47
40
66
68
6
15
8
8
6
9
17
22
18
4
23
9
8
60
52
20
47
33
22
6
44
22
26
34
12
12
The REG Procedure
Model: MODEL1
Dependent Variable: y
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
4
8
12
2664.52051
47.67641
2712.19692
666.13013
5.95955
Root MSE
Dependent Mean
Coeff Var
Correlation
Variable
x1
x2
x3
x4
y
x1
x2
x3
x4
y
1.0000
0.2286
-0.8241
-0.2454
0.7309
0.2286
1.0000
-0.1392
-0.9730
0.8162
-0.8241
-0.1392
1.0000
0.0295
-0.5348
-0.2454
-0.9730
0.0295
1.0000
-0.8212
0.7309
0.8162
-0.5348
-0.8212
1.0000
2.44122
95.41538
2.55852
R-Square
Adj R-Sq
472
471
Collinearity Diagnostics
Parameter Estimates
Variable
DF
Parameter
Estimate
Intercept
x1
x2
x3
x4
1
1
1
1
1
63.16602
1.54305
0.50200
0.09419
-0.15152
0.9824
0.9736
Standard
Error
t Value
69.93378
0.74331
0.72237
0.75323
0.70766
0.90
2.08
0.69
0.13
-0.21
Pr > |t|
0.3928
0.0716
0.5068
0.9036
0.8358
Number
Eigenvalue
Condition
Index
1
2
3
4
5
4.11970
0.55389
0.28870
0.03764
0.00006614
1.00000
2.72721
3.77753
10.46207
249.57825
Variable
DF
Type I SS
Type II SS
Collinearity Diagnostics
Intercept
x1
x2
x3
x4
1
1
1
1
1
118353
1448,75413
1205.70283
9.79033
0.27323
4.86191
25.68225
2.87801
0.09319
0.27323
---------------Proportion of Variation---------------Intercept
x1
x2
x3
x4
473
1
2
3
4
5
0.000005
8.812E-8
3.060E-7
0.000127
0.99987
0.00037
0.01004
0.000581
0.05745
0.93157
0.00002
0.00001
0.00032
0.00278
0.99687
0.00021
0.00266
0.00159
0.04569
0.94985
0.00036
0.00010
0.00168
0.00088
0.99730
474
Obs
Dep Var
y
Predicted
Value
Std Error
Mean Predict
Student
Residual
1
2
3
4
5
6
7
8
9
10
11
12
13
78.5000
74.3000
104.3000
87.6000
95.9000
109.2000
102.7000
72.5000
93.1000
115.9000
83.8000
113.2000
109.4000
78.4929
72.8005
105.9744
89.3333
95.6360
105.2635
104.1289
75.6760
91.7218
115.6010
81.8034
112.3007
111.6675
1.8109
1.4092
1.8543
1.3265
1.4598
0.8602
1.4791
1.5604
1.3244
2.0431
1.5924
1.2519
1.3454
0.00432
0.752
-1.054
-0.846
0.135
1.723
-0.736
-1.692
0.672
0.224
1.079
0.429
-1.113
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
|
|
|
|
|
|
|
|
|
|
|
|
|
-2-1 0 1 2
|
|*
**|
*|
|
|***
*|
***|
|*
|
|**
|
**|
|
|
|
|
|
|
|
|
|
|
|
|
|
Cook's D
0.000
0.057
0.303
0.060
0.002
0.084
0.063
0.395
0.038
0.023
0.172
0.013
0.108
The REG Procedure
Model: MODEL1
R-Square Selection Method
Regression Models for Dependent Variable: y
Number in
Model
R-Square
AIC
Variables
in Model
SBC
1
0.6744
58.8383
59.96815 x4
1
0.6661
59.1672
60.29712 x2
1
0.5342
63.4964
64.62630 x1
1
0.2860
69.0481
70.17804 x3
-----------------------------------------------------2
0.9787
25.3830
27.07785 x1 x2
2
0.9726
28.6828
30.37766 x1 x4
2
0.9353
39.8308
41.52565 x3 x4
2
0.8470
51.0247
52.71951 x2 x3
2
0.6799
60.6172
62.31201 x2 x4
2
0.5484
65.0933
66.78816 x1 x3
------------------------------------------------------3
0.9824
24.9187
27.17852 x1 x2 x4
3
0.9823
24.9676
27.22742 x1 x2 x3
3
0.9814
25.6553
27.91511 x1 x3 x4
3
0.9730
30.4953
32.75514 x2 x3 x4
------------------------------------------------------4
0.9824
26.8933
29.71808 x1 x2 x3 x4
475
This output was produced by the e option
in the model statement of the GLM procedure.
It indicates that all five regression
parameters are estimable.
476
This output was produced by the e1 option
in the model statement of the GLM procedure.
It describes the null hypotheses that are
tested with the sequential Type I sums of
squares.
The GLM Procedure
Type I Estimable Functions
General Form of Estimable Functions
Effect
Coefficients
Intercept
L1
x1
L2
x2
L3
x3
L4
x4
L5
477
Effect
----------------Coefficients--------------x1
x2
x3
x4
Intercept
0
0
0
0
x1
L2
0
0
0
x2
0.6047*L2
L3
0
0
x3
-0.8974*L2
0.0213*L3
L4
0
x4
-0.6984*L2
-1.0406*L3
-1.0281*L4
478
L5
>
>
>
>
>
>
Type II Estimable Functions
----Coefficients---x1
x2
x3
x4
Effect
Intercept
0
0
0
0
x1
L2
0
0
0
x2
0
L3
0
0
x3
0
0
L4
0
x4
0
0
0
L5
# The commands are posted as:
#
#
#
#
cement.spl
The data file is stored under the name
cement.txt. It has variable names on the
first line. We will enter the data into
a data frame.
> cement<-read.table("c:\\cement.txt",header=T)
> cement
1
2
3
4
5
6
7
8
9
run
1
2
3
4
5
6
7
8
9
X1
7
1
11
11
7
11
3
1
2
X2
26
29
56
31
52
55
71
31
54
X3
6
15
8
8
6
9
17
22
18
X4
60
52
20
47
33
22
6
44
22
Y
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
479
10
11
12
13
10
11
12
13
21
1
11
10
47
40
66
68
4
23
9
8
26
34
12
12
115.9
83.8
113.2
109.4
> # Create a scatterplot matrix with smooth
> # curves. Unix users should first use
> # motif( ) to open a graphics wundow
> # Compute correlations and round the results
> # to four significant digits
> round(cor(cement[-1]),4)
X1
X2
X3
X4
Y
X1
1.0000
0.2286
-0.8241
-0.2454
0.7309
X2
0.2286
1.0000
-0.1392
-0.9730
0.8162
X3
-0.8241
-0.1392
1.0000
0.0295
-0.5348
480
X4
-0.2454
-0.9730
0.0295
1.0000
-0.8212
Y
0.7309
0.8162
-0.5348
-0.8212
1.0000
481
> points.lines <- function(x, y)
+ {
+ points(x, y)
+ lines(loess.smooth(x, y, 0.90))
+ }
> par(din=c(7,7),pch=18,mkh=.15,cex=1.2,lwd=3)
> pairs(cement[ ,-1], panel=points.lines)
482
10
30
50
15
20
30 40 50 60 70
> cement.out <- lm(Y~X1+X2+X3+X4, cement)
> summary(cement.out)
70
5
10
X1
> # Use the lm( ) function to
> # fit a linear regression model
50
60
Call: lm(formula = Y ~ X1+X2+X3+X4, data=cement)
Residuals:
Min
1Q Median
3Q Max
-3.176 -1.674 0.264 1.378 3.936
15
20
30
40
X2
Coefficients:
5
Residual standard error: 2.441 on 8 d.f.
Multiple R-Squared: 0.9824
40
50
60
(Intercept)
X1
X2
X3
X4
100
10
20
30
X4
80
90
Y
5
10
15
20
5
10
15
20
80 90
110
F-statistic: 111.8 on 4 and 8 degrees of freedom,
the p-value is 4.707e-007
484
483
Correlation of
(Intercept)
X1 -0.9678
X2 -0.9978
X3 -0.9769
X4 -0.9983
Coefficients:
X1
X2
Std.
Value
Error t value Pr(>|t|)
63.1660 69.9338 0.9032 0.3928
1.5431 0.7433 2.0759 0.0716
0.5020 0.7224 0.6949 0.5068
0.0942 0.7532 0.1250 0.9036
-0.1515 0.7077 -0.2141 0.8358
110
10
X3
X3
0.9510
0.9861 0.9624
0.9568 0.9979 0.9659
>
>
>
>
#
#
#
#
Create a function to evaluate an orthogonal
projection matrix. Then create a function
to compute type II sums of squares.
This uses the ginverse( ) function
>
>
>
>
>
>
+
>
#=======================================
# project( )
#-------------# calculate orthogonal projection matrix
#=======================================
project <- function(X)
{X%*%ginverse(crossprod(X))%*%t(X)}
#=======================================
> anova(cement.out)
Analysis of Variance Table
Response: Y
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value Pr(F)
X1 1 1448.754 1448.754 243.0978 0.0000
X2 1 1205.703 1205.703 202.3144 0.0000
X3 1
9.790
9.790 1.6428 0.2358
X4 1
0.273
0.273 0.0458 0.8358
Resid 8
47.676
5.960
485
486
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
#========================================
# typeII.SS( )
#-----------------# calculate Type II sum of squares
#
# input lmout = object made by the
#
lm( ) function
#
y = dependent variable
#========================================
typeII.SS <- function(lmout,y)
{
# generate the model matrix
model <- model.matrix(lmout)
# create list of parameter names
par.name <- dimnames(model)[[2]]
# compute number of parameters
n.par <- dim(model)[2]
# Compute residual mean square
SS.res <- deviance(lmout)
df2
<- lmout$df.resid
MS.res <- SS.res/df2
+ result <- NULL
# store results
+
+
# Compute Type II SS
+ for (i in 1:n.par) {
+ A <- project(model)-project(model[,-i])
+ SS.II <- t(y) %*% A %*% y
+ df1
<- qr(project(model))$rank +
qr(project(model[ ,-i]))$rank
+ MS.II <- SS.II/df1
+ F.stat <- MS.II/MS.res
+ p.val <- 1-pf(F.stat,df1,df2)
+ temp <- cbind(df1,SS.II,MS.II,F.stat,p.val)
+ result <- rbind(result,temp)
+ }
+
+result<-rbind(result,c(df2,SS.res,MS.res,NA,NA))
+dimnames(result)<-list(c(par.name,"Residual"),
+c("Df","Sum of Sq","Mean Sq","F Value","Pr(F)"))
+ cat("Analysis of Variance
+ (TypeII Sum of Squares) \n")
+ round(result,6)
+ }
> #==========================================
488
487
>
>
>
>
>
> typeII.SS(cement.out, cement$Y)
Analysis of Variance
Df Sum of Sq
(Inter) 1 4.861907
X1 1 25.682254
X2 1 2.878010
X3 1 0.093191
X4 1 0.273229
Resid 8 47.676412
(TypeII Sum of Squares)
Mean Sq F Value
Pr(F)
4.861907 0.815818 0.392790
25.682254 4.309427 0.071568
2.878010 0.482924 0.506779
0.093191 0.015637 0.903570
0.273229 0.045847 0.835810
5.959551
NA
NA
#
#
#
#
#
Venables and Ripley have supplied functions
studres( ) and stdres( ) to compute
studentized and standardized residuals.
Use the library( ) function to attach the
MASS library before using these functions.
> library(MASS)
> cement.res <- cbind(cement$Y,
+
cement.out$fitted,
+
cement.out$resid,
+
studres(cement.out),
+
stdres(cement.out))
> dimnames(cement.res) <- list(cement$run,
+
c("Response","Predicted","Residual",
+
"Stud. Res.","Std. Res."))
> round(cement.res,4)
489
490
Response
1 78.5
2 74.3
3 104.3
4 87.6
5 95.9
6 109.2
7 102.7
8 72.5
9 93.1
10 115.9
11 83.8
12 113.2
13 109.4
Predicted
78.4929
72.8005
105.9744
89.3333
95.6360
105.2635
104.1289
75.6760
91.7218
115.6010
81.8034
112.3007
111.6675
Residual Stud. Res.
0.0071
0.0040
1.4995
0.7299
-1.6744 -1.0630
-1.7333 -0.8291
0.2640
0.1264
3.9365
2.0324
-1.4289 -0.7128
-3.1760 -1.9745
1.3782
0.6472
0.2990
0.2100
1.9966
1.0919
0.8993
0.4061
-2.2675 -1.1326
Std. Res.
0.0043
0.7522
-1.0545
-0.8458
0.1349
1.7230
-0.7358
-1.6917
0.6721
0.2237
1.0790
0.4291
-1.1131
> # Produce plots for model diagnostics
> # including Cook's D.
> par(mfrow=c(3,2))
> plot(cement.out)
> # Search for a simpler model
> cement.stp <- step(cement.out,
+
scope=list(upper = ~X1 + X2 + X3 + X4,
+
lower = ~ 1), trace=F)
492
491
2.0
4
> # Search for a simpler model
6
6
1.5
> cement.stp <- step(cement.out,
+
scope=list(upper = ~X1 + X2 + X3 + X4,
+
lower = ~ 1), trace=F)
1.0
13
0.5
0
-2
Residuals
2
sqrt(abs(Residuals))
8
13
8
80
90
100
110
80
90
100
> cement.stp$anova
110
fits
4
Fitted : X1 + X2 + X3 + X4
Initial Model:
Y ~ X1 + X2 + X3 + X4
13
8
80
90
100
110
-1
Fitted : X1 + X2 + X3 + X4
1
Final Model:
Y ~ X1 + X2
0.4
0.3
0.2
f-value
3
0.1
11
0.0
-20
0.6
8
0.2
20
10
0
-10
Cook’s Distance
20
Residuals
10
-10
-20
0.2
0
Quantiles of Standard Normal
Y
0
Fitted Values
Stepwise Model Path
Analysis of Deviance Table
0
-2
80
90
Y
Residuals
100
2
110
6
0.6
2
4
6
8
10
12
Index
493
Step Df Deviance Resid. Df Resid. Dev
AIC
1
8 47.67641 107.2719
2 - X3 1 0.093191
9 47.76960 95.4460
3 - X4 1 9.970363
10 57.73997 93.4973
494
Download