Topic 13: Multiple Linear Regression Example

advertisement
Topic 13: Multiple Linear
Regression Example
Outline
•
•
•
•
Description of example
Descriptive summaries
Investigation of various models
Conclusions
Study of CS students
• Too many computer science majors at
Purdue were dropping out of program
• Wanted to find predictors of success
to be used in admissions process
• Predictors must be available at time of
entry into program.
Data available
•
•
•
•
•
•
•
GPA after three semesters
Overall high school math grade
Overall high school science grade
Overall high school English grade
SAT Math
SAT Verbal
Gender (of interest for other reasons)
Data for CS Example
• Y is the student’s grade point
average (GPA) after 3 semesters
• 3 HS grades and 2 SAT scores are
the explanatory variables (p=6)
• Have n=224 students
Descriptive Statistics
Data a1;
infile 'C:\...\csdata.dat';
input id gpa hsm hss hse
satm satv genderm1;
proc means data=a1 maxdec=2;
var gpa hsm hss hse satm satv;
run;
Output from Proc Means
Variable
gpa
hsm
hss
hse
satm
satv
N Mean Std Dev Minimum Maximum
224 2.64
0.78
0.12
4.00
224 8.32
1.64
2.00
10.00
224 8.09
1.70
3.00
10.00
224 8.09
1.51
3.00
10.00
224 595.29
86.40
300.00
800.00
224 504.55
92.61
285.00
760.00
Descriptive Statistics
proc univariate data=a1;
var gpa hsm hss hse
satm satv;
histogram gpa hsm hss hse
satm satv /normal;
run;
Correlations
proc corr data=a1;
var hsm hss hse satm satv;
proc corr data=a1;
var hsm hss hse satm satv;
with gpa;
run;
Output from Proc Corr
gpa
hsm
hss
hse
satm
satv
Pearson Correlation Coefficients, N = 224
Prob > |r| under H0: Rho=0
gpa
hsm
hss
hse
satm
1.00000 0.43650 0.32943 0.28900 0.25171
<.0001
<.0001
<.0001
0.0001
0.43650 1.00000 0.57569 0.44689 0.45351
<.0001
<.0001
<.0001
<.0001
0.32943 0.57569 1.00000 0.57937 0.24048
<.0001
<.0001
<.0001
0.0003
0.28900 0.44689 0.57937 1.00000 0.10828
<.0001
<.0001
<.0001
0.1060
0.25171 0.45351 0.24048 0.10828 1.00000
0.0001
<.0001
0.0003
0.1060
0.11449 0.22112 0.26170 0.24371 0.46394
0.0873
0.0009
<.0001
0.0002
<.0001
satv
0.11449
0.0873
0.22112
0.0009
0.26170
<.0001
0.24371
0.0002
0.46394
<.0001
1.00000
Output from Proc Corr
Pearson Correlation Coefficients, N = 224
Prob > |r| under H0: Rho=0
gpa
hsm
hss
hse
satm
0.43650 0.32943 0.28900 0.25171
<.0001 <.0001 <.0001 0.0001
All but SATV significantly
correlated with GPA
satv
0.11449
0.0873
Scatter Plot Matrix
proc corr data=a1 plots=matrix;
var gpa hsm hss hse satm satv;
run;
Allows visual check of pairwise
relationships
No “strong”
linear
Relationships
Can see
discreteness
of high
school
scores
Use high school grades
to predict GPA (Model #1)
proc reg data=a1;
model gpa=hsm hss hse;
run;
Results Model #1
Root MSE
Dependent Mean
Coeff Var
0.69984 R-Square
2.63522 Adj R-Sq
26.55711
0.2046
0.1937
Meaningful??
Variable
Intercept
hsm
hss
hse
DF
1
1
1
1
Parameter Estimates
Parameter Standard
Estimate
Error
0.58988
0.29424
0.16857
0.03549
0.03432
0.03756
0.04510
0.03870
t Value
2.00
4.75
0.91
1.17
Pr > |t|
0.0462
<.0001
0.3619
0.2451
ANOVA Table #1
Source
Model
Error
Corrected Total
Analysis of Variance
Sum of
Mean
DF
Squares Square
3
27.71233 9.23744
220 107.75046 0.48977
223 135.46279
F Value
18.86
Pr > F
<.0001
Significant F test but not all
variable t tests significant
Remove HSS (Model #2)
proc reg data=a1;
model gpa=hsm hse;
run;
Results Model #2
Root MSE
Dependent Mean
Coeff Var
0.69958 R-Square
2.63522 Adj R-Sq
26.54718
0.2016
0.1943
Slightly better MSE and adjusted R-Sq
Variable
Intercept
hsm
hse
DF
1
1
1
Parameter Estimates
Parameter Standard
Estimate
Error
0.62423
0.29172
0.18265
0.03196
0.06067
0.03473
t Value
2.14
5.72
1.75
Pr > |t|
0.0335
<.0001
0.0820
ANOVA Table #2
Source
Model
Error
Corrected Total
Analysis of Variance
Sum of
Mean
DF
Squares Square
2
27.30349 13.65175
221 108.15930 0.48941
223 135.46279
F Value
27.89
Pr > F
<.0001
Significant F test but not all
variable t tests significant
Rerun with HSM only
(Model #3)
proc reg data=a1;
model gpa=hsm;
run;
Results Model #3
Root MSE
Dependent Mean
Coeff Var
0.70280 R-Square
2.63522 Adj R-Sq
26.66958
0.1905
0.1869
Slightly worse MSE and adjusted R-Sq
Variable
Intercept
hsm
DF
1
1
Parameter Estimates
Parameter Standard
Estimate
Error
0.90768
0.24355
0.20760
0.02872
t Value
3.73
7.23
Pr > |t|
0.0002
<.0001
ANOVA Table #3
Source
Model
Error
Corrected Total
Analysis of Variance
Sum of
Mean
DF
Squares Square
1
25.80989 25.80989
222 109.65290 0.49393
223 135.46279
F Value
52.25
Pr > F
<.0001
Significant F test and all
variable t tests significant
SATs (Model #4)
proc reg data=a1;
model gpa=satm satv;
run;
Results Model #4
Root MSE
Dependent Mean
Coeff Var
0.75770 R-Square
2.63522 Adj R-Sq
28.75287
0.0634
0.0549
Much worse MSE and adjusted R-Sq
Variable
Intercept
satm
satv
DF
1
1
1
Parameter Estimates
Parameter Standard
Estimate
Error t Value Pr > |t|
1.28868
0.37604
3.43 0.0007
0.00228 0.00066291
3.44 0.0007
-0.00002456 0.00061847
-0.04 0.9684
ANOVA Table #4
Source
Model
Error
Corrected Total
Analysis of Variance
Sum of
Mean
DF
Squares Square
2
8.58384 4.29192
221 126.87895 0.57411
223 135.46279
F Value
7.48
Pr > F
0.0007
Significant F test but not all
variable t tests significant
HS and SATs (Model #5)
proc reg data=a1;
model gpa=satm satv
hsm hss hse;
*Does general linear test;
sat: test satm, satv;
hs: test hsm, hss, hse;
Results Model #5
Root MSE
Dependent Mean
Coeff Var
Variable
Intercept
hsm
hss
hse
satm
satv
DF
1
1
1
1
1
1
0.70000 R-Square
2.63522 Adj R-Sq
26.56311
Parameter Estimates
Parameter
Standard
Estimate
Error
0.32672
0.40000
0.14596
0.03926
0.03591
0.03780
0.05529
0.03957
0.00094359 0.00068566
-0.00040785 0.00059189
0.2115
0.1934
t Value
0.82
3.72
0.95
1.40
1.38
-0.69
Pr > |t|
0.4149
0.0003
0.3432
0.1637
0.1702
0.4915
Test sat
Test sat Results for Dependent Variable gpa
Mean
Source
DF Square
F Value Pr > F
Numerator
2 0.46566
0.95 0.3882
Denominator
218
0.49000
Cannot reject the reduced model…No
significant information lost…We
don’t need SAT variables
Test hs
Test hs Results for Dependent Variable gpa
Mean
Source
DF Square
F Value Pr > F
Numerator
3 6.68660
13.65 <.0001
Denominator
218
0.49000
Reject the reduced model…There is
significant information lost…We
can’t remove HS variables from
model
Best Model?
• Likely the one with just HSM or the
one with HSE and HSM.
• We’ll discuss comparison methods
in Chapters 7 and 8
Key ideas from case
study
• First, look at graphical and numerical
summaries one variable at a time
• Then, look at relationships between pairs
of variables with graphical and numerical
summaries.
• Use plots and correlations to understand
relationships
Key ideas from case
study
• The relationship between a response
variable and an explanatory variable
depends on what other explanatory
variables are in the model
• A variable can be a significant (P<.05)
predictor alone and not significant
(P>0.5) when other X’s are in the model
Key ideas from case
study
• Regression coefficients, standard errors
and the results of significance tests
depend on what other explanatory
variables are in the model
Key ideas from case
study
• Significance tests (P values) do not tell
the whole story
• Squared multiple correlations give the
proportion of variation in the response
variable explained by the explanatory
variables) can give a different view
• We often express R2 as a percent
Key ideas from case
study
• You can fully understand the theory in
terms of Y = Xβ + e
• However to effectively use this
methodology in practice you need to
understand how the data were collected,
the nature of the variables, and how they
relate to each other
Background Reading
• Cs2.sas contains the SAS
commands used in this topic
Download