Topic_15

advertisement
Topic 15: General Linear
Tests and Extra Sum of
Squares
Outline
• Extra Sums of Squares with
applications
• Partial correlations
• Standardized regression
coefficients
General Linear Tests
• Recall: A different way to look at the
comparison of models
• Look at the difference
– in SSE (reduce unexplained SS)
– in SSM (increase explained SS)
• Because SSM+SSE=SST, these two
comparisons are equivalent
General Linear Tests
• Models we compare are hierarchical in the
sense that one (the full model) includes all
of the explanatory variables of the other
(the reduced model)
• We can compare models with different
explanatory variables such as
1. X1, X2 vs X1
2. X1, X2, X3, X4, X5 vs X1, X2, X3
(Note first model includes all Xs of second)
General Linear Tests
• We will get an F test that compares
the two models
• We are testing a null hypothesis that
the regression coefficients for the
extra variables are all zero
• For X1, X2, X3, X4, X5 vs X1 , X2 , X3
– H 0: β 4 = β 5 = 0
– H1: β4 and β5 are not both 0
General Linear Tests
(SSE(R)  SSE(F)) / ( df E (R)  df E (F))
F 
SSE(F) / df E (F)
*
• Degrees of freedom for the F statistic are the
number of extra variables and the dfE for the
larger model
• Suppose n=100 and we compare models with
X1, X2, X3, X4, X5 vs X1 , X2 , X3
• Numerator df is 2
• Denominator df is n-6 = 94
Notation for Extra SS
• SSE(X1,X2,X3,X4,X5) is the SSE for the full
model
• SSE(X1,X2,X3) is the SSE for the reduced
model
• SSE(X4,X5 | X1,X2,X3) is the difference in
the SSEs (reduced minus full)
SSE(X1,X2,X3) - SSE(X1,X2,X3,X4,X5)
• Recall can replace SSE with SSM
F test
•
•
•
•
Numerator : (SSE(X4,X5 | X1,X2,X3))/2
Denominator : MSE(X1,X2,X3,X4,X5)
F ~ F(2, n-6)
Reject if the P-value ≤ 0.05 and
conclude that either X4 or X5 or both
contain additional information useful
for predicting Y in a linear model that
also includes X1, X2, and X3
Examples
• Predict bone density using age, weight
and height; does diet add any useful
information?
• Predict GPA using 3 HS grade variables;
do SAT scores add any useful
information?
• Predict yield of an industrial process
using temperature and pH; does the
supplier of the raw material (categorical)
add any useful information?
Extra SS Special Cases
• Compare models that differ by one
explanatory variable, F(1,n-p)=t2(n-p)
• SAS’s individual parameter t-tests
are equivalent to the general linear
test based on
SSM(Xi|X1,…, Xi-1, Xi+1 ,…, Xp-1)
Add one variable at a time
• Consider 4 explanatory variables and the
extra sum of squares
– SSM (X1)
– SSM (X2 | X1)
– SSM (X3 |X1, X2)
– SSM (X4 |X1, X2, X3)
• SSM (X1) +SSM (X2 | X1) + SSM (X3 | X1, X2)
+ SSM (X4 | X1, X2, X3) =SSM(X1, X2, X3, X4)
One Variable added
•
•
•
•
Numerator df is 1 for each of these tests
F = (SSM / 1) / MSE( full ) ~ F(1, n-p)
This is the SAS Type I SS
We typically use Type II SS
KNNL Example p 257
•
•
•
•
•
•
20 healthy female subjects
Y is body fat
X1 is triceps skin fold thickness
X2 is thigh circumference
X3 is midarm circumference
Underwater weighing is the “gold
standard” used to obtain Y
Input and data check
options nocenter;
data a1;
infile ‘../data/ch07ta01.dat';
input skinfold thigh midarm fat;
proc print data=a1;
run;
Proc reg
proc reg data=a1;
model fat=skinfold thigh midarm;
run;
Output
Analysis of Variance
Source
Model
Error
Corrected Total
DF
3
16
19
Sum of
Squares
396.98461
98.40489
495.38950
Mean
Square
132.32820
6.15031
F Value
21.52
Group of predictors helpful in
predicting percent body fat
Pr > F
<.0001
Output
Variable DF
Intercept
1
skinfold
1
thigh
1
midarm
1
Parameter Estimates
Parameter Standard
Estimate
Error t Value Pr > |t|
117.08469 99.78240
1.17 0.2578
4.33409 3.01551
1.44 0.1699
-2.85685 2.58202
-1.11 0.2849
-2.18606 1.59550
-1.37 0.1896
None of the individual t-tests are
significant.
Summary
• The P-value for F test is <.0001
• But the P-values for the individual
regression coefficients are 0.1699,
0.2849, and 0.1896
• None of these are below our standard
significance level of 0.05
• What is the reason?
Look at this using extra SS
proc reg data=a1;
model
fat=skinfold thigh
midarm /ss1 ss2;
run;
Output
Parameter Estimates
Variable
skinfold
Parameter Standard
DF Estimate
Error t Value Pr > |t|
1
4.33409 3.01551
1.44 0.1699
Type I SS Type II SS
352.26980 12.70489
thigh
1
-2.85685
2.58202
-1.11 0.2849
33.16891
7.52928
midarm
1
-2.18606
1.59550
-1.37 0.1896
11.54590
11.54590
Notice how different these
SS are for skinfold and
thigh
Interpretation
• Fact: the Type I and Type II SS are
very different
• If we reorder the variables in the
model statement we will get
– Different Type I SS
– The same Type II SS
• Could variables be explaining same
SS and canceling each other out?
Run additional models
• Rerun with skinfold as the
explanatory variable
proc reg data=a1;
model fat=skinfold;
run;
Output
Parameter Estimates
Variable
Intercept
skinfold
Parameter Standard
DF
Estimate
Error t Value Pr > |t|
1
-1.49610
3.31923
-0.45 0.6576
1
0.85719
0.12878
6.66 <.0001
Skinfold by itself is a highly
significant linear predictor
Use general linear test to
see if other predictors
contribute beyond skinfold
proc reg data=a1;
model fat=
skinfold thigh midarm;
thimid: test thigh, midarm;
run;
Output
Test thimid Results for Dependent Variable fat
Mean
Source
DF
Square
F Value Pr > F
Numerator
2 22.35741
3.64 0.0500
Denominator
16
6.15031
Yes they are help after
skinfold is in the model.
Perhaps best model includes
only two predictors
Use general linear test to
assess midarm
proc reg data=a1;
model fat=
skinfold thigh midarm;
midarm: test midarm;
run;
Output
Test thimid Results for Dependent Variable fat
Mean
Source
DF
Square
F Value Pr > F
Numerator
1 11.54590
1.88 0.1896
Denominator
16
6.15031
With skinfold and thigh in
the model, midarm is not a
significant predictor.
This is just the t-test for
this coef in full model
Other uses of general
linear test
• The test statement can be used to
perform a significance test for any
hypothesis involving a linear
combination of the regression
coefficients
• Examples
– H0: β4 = β5
– H0: β4 - 3β5 = 12
Partial correlations
• Measures the strength of a linear relation
between two variables taking into account
other variables
• Procedure to find partial correlation Xi , Y
– Predict Y using other X’s
– Predict Xi using other X’s
– Find correlation between the two sets
of residuals
KNNL use the term coefficient of partial
determination for the squared partial
correlation
Pcorr2 option
proc reg data=a1;
model fat=skinfold thigh midarm
/ pcorr2;
run;
Output
Parameter Estimates
Variable DF
Intercept 1
skinfold
1
thigh
1
midarm
1
Parameter Standard
Estimate
Error t Value Pr > |t|
117.08469 99.78240
1.17 0.2578
4.33409 3.01551
1.44 0.1699
-2.85685 2.58202 -1.11 0.2849
-2.18606 1.59550 -1.37 0.1896
Skinfold and midarm explain
the most remaining variation
when added last
Squared
Partial
Corr Type II
.
0.11435
0.07108
0.10501
Standardized Regression
Model
• Can help reduce round-off errors in
calculations
• Puts regression coefficients in
common units
• Units for the usual coefficients are
units for Y divided by units for X
Standardized Regression
Model
• Standardized coefs can be obtained
from the usual ones by multiplying
by the ratio of the standard deviation
of X to the standard deviation of Y
• Interpretation is that a one sd
increase in X corresponds to a
‘standardized beta’ increase in Y
Standardized Regression
Model
• Y = … + βX + …
• = … + β(sX/sY)(sY/sX)X + …
• = … + β(sX/sY)((sY/sX)X) + …
• = … + β(sX/sY)(sY)(X/sX) + …
Standardized Regression
Model
• Standardize Y and all X’s (subtract mean
and divide by standard deviation)
• Then divide by n-1
• The regression coefficients for variables
transformed in this way are the
standardized regression coefficients
STB option
proc reg data=a1;
model fat=skinfold thigh midarm
/ stb;
run;
Output
Variable
Intercept
skinfold
thigh
midarm
DF
1
1
1
1
Parameter Estimates
Parameter Standard
Estimate
Error t Value
117.08469 99.78240
1.17
4.33409 3.01551
1.44
-2.85685 2.58202
-1.11
-2.18606 1.59550
-1.37
Skinfold and thigh
suggest largest
standardized change
Pr > |t|
0.2578
0.1699
0.2849
0.1896
Standardized
Estimate
0
4.26370
-2.92870
-1.56142
Reading
• We went over 7.1 – 7.5
• We used program topic15.sas to
generate the output
Download