Topic 14: Inference in Multiple Regression

advertisement
Topic 14: Inference in
Multiple Regression
Outline
• Review multiple linear regression
• Inference of regression coefficients
– Application to book example
• Inference of mean
– Application to book example
• Inference of future observation
• Diagnostics and remedies
Data for Multiple
Regression
• Yi is the response variable
• Xi1, Xi2, … , Xi,p-1 are the p-1
explanatory variables
• Yi, Xi1, Xi2, … , Xi,p-1 are the data for
case i, where i = 1 to n
Multiple Regression Model
• Yi = β0 + β1Xi1 + β2Xi2 +…+ βp-1Xi,p-1 + ei
• Yi is the value of the response variable
for the ith case
• β0 is the intercept
• β1, β2, … , βp-1 are the regression
coefficients for the explanatory variables
• ei are independent Normally distributed
random errors with mean 0 and variance σ2
Least Squares Solutions
b  ( XX ) XY
1
2
s
= MSE= Y(I  H )Y /( n  p )
s = Root MSE
ANOVA F-test
• H0: β1 = β2 = … = βp-1 = 0
• Ha: βk ≠ 0, for at least one k=1,2,…,p-1
• Under H0, F ~ F(p-1,n-p)
• Reject H0 if F is large, using P-value we
reject if the P-value ≤ 0.05
Inference for individual
regression coefficients
• We can show b ~ N(β, σ2(X΄X)-1)
• Define
s {b}  MSE (XX )
2
p p
s {b k }  s {b}k ,k
2
2
1
Significance Test for βk
•
•
•
•
•
H 0: β k = 0
Same test statistic t* = bk/s(bk)
Still use dfE which now equals n-p
P-value computed from t(n-p) dist
This tests the significance of a
variable given the other variables are
already in the model (i.e., fitted last)
Confidence interval for βk
• CI: bk ± tcs(bk), where tc = t(.975, n-p)
• Same form as before but dfE now
equals n-p
• This interval describes region of bk
given the other variables are in the
model
Example II
(KNNL p 236)
• Dwaine Studios, Inc. operates portrait
studios in 21 cities of medium size
• Yi is sales in city i
• X1 : population aged 16 and under
• X2 : per capita disposable income
Yi   0   1 X i 1   2 X i 2   i
Read in the data
data a1;
infile ‘../data/ch06fi05.txt';
input young income sales;
proc print data=a1;
run;
Partial Proc Print Results
Obs
young
1
2
3
4
5
68.5
45.2
91.3
47.8
46.9
income
sales
16.7
16.8
18.2
16.3
17.3
174.4
164.4
244.2
154.6
181.6
Proc Reg
proc reg data=a1;
model sales=young income;
run;
Output
Analysis of Variance
Sum of
Mean
Source
DF Squares Square F Value Pr > F
Model
2
24015
12008 99.10 <.0001
Error
18 2180.9274 121.1626
Corrected Total 20
26196
Root MSE
11.00739 R-Square 0.917
At least one variable is helpful in
predicting in sales
Output
Variable DF
Intercept
1
young
1
income
1
Parameter Estimates
Parameter Standard
Estimate
Error t Value Pr > |t|
-68.85707 60.01695 -1.15 0.2663
1.45456 0.21178
6.87 <.0001
9.36550 4.06396
2.30 0.0333
Both variables are helpful in
explaining sales after the other
is already in the model
CLB option
• Used to get confidence
intervals for each coefficient
proc reg data=a1;
model sales=young income/clb;
run;
Output
Parameter Estimates
Variable
Intercept
young
income
DF
1
1
1
Parameter Standard 95% Confidence
Estimate
Error
Limits
-68.85707 60.01695 -194.94801 57.23387
1.45456
0.21178
1.00962 1.89950
9.36550
4.06396
0.82744 17.90356
What if just young fit?
Parameter Estimates
Variable
Intercept
young
DF
1
Parameter
Estimate
68.04536
1
1.83588
Standard 95% Confidence
Error
Limits
9.46224 48.24066 87.85006
0.14641
1.52943
CIs for both the intercept and
young change dramatically when
just young as explanatory
variable
2.14233
Estimation of E(Yh)
• Xh is now a vector that looks like
(1, Xh1, Xh2, … , Xh,p-1)΄
• We want a point estimate and a
confidence interval for the
subpopulation mean corresponding
to the set of explanatory variables Xh
Theory for E(Yh)
E(Yh )   h  Xh 
ˆ h  Xh b
s ( ˆ h )  Xhs {b}X h  s Xh (XX) X h
2
2
2
CI : ˆ h  s( ˆ h ) t(0.975, n - p)
-1
Using CLM option
proc reg data=a1;
model sales=young income/clm;
id young income;
run;
Adds them to
output table
CLM Output
Output Statistics
Obs young income
1
68.5
16.7
Dependent
Variable
174.4000
Predicted
Value
187.1841
Std Error
Mean Predict 95% CL Mean
3.8409 179.1146 195.2536
2
45.2
16.8
164.4000
154.2294
3.5558 146.7591 161.6998
3
91.3
18.2
244.2000
234.3963
4.5882 224.7569 244.0358
4
47.8
16.3
154.6000
153.3285
3.2331 146.5361 160.1210
5
46.9
17.3
181.6000
161.3849
4.4300 152.0778 170.6921
21
52.3
16.0
166.5000
157.0644
4.0792 148.4944 165.6344
Prediction of Yh
• Xh is still a vector of form
(1, Xh1, Xh2, … , Xh,p-1)΄
• We want a prediction of Yh based on
a set of predictor values with an
interval that expresses the
uncertainty in our prediction
Theory for Yh
Yh  Xh     Ŷh  ˆ h  Xh b
s (pred )  Var (Ŷh   )
2
 Var (Ŷh )  Var ( )
 s (1  Xh (XX) X h )
2
-1
CI : ˆ h  s (pred ) t(.975, n - p )
Using the CLI option
proc reg data=a1;
model sales=young income/cli;
id young income;
run;
Adds them to
output table
CLI Output
Output Statistics
Dependent Predicted
Std Error
Obs young income
Variable
Value Mean Predict 95% CL Predict
1
68.5
16.7
174.4000 187.1841
3.8409 162.6910 211.6772
2
45.2
16.8
164.4000
154.2294
3.5558 129.9271 178.5317
3
91.3
18.2
244.2000
234.3963
4.5882 209.3421 259.4506
21
52.3
16.0
166.5000
157.0644
4.0792 132.4018 181.7270
Diagnostics
• Look at the distribution of each
variable
• Look at the relationship between pairs
of variables
• Plot the residuals versus
– the predicted/fitted values
– each explanatory variable
– time (if available)
Diagnostics
• Are the residuals approximately Normal
– Look at a histogram
– Normal quantile plot
• Is the variance constant
– Plot the residuals vs anything that
might be related to the variance (e.g.
residuals vs predicted values &
residuals versus each X)
Remedies
• Similar remedies as simple
regression
• Transformations such as Box-Cox
• Analyze with/without outliers
• More detail in KNNL Ch 9 and 10
Background Reading
• We finished Chapter 6.
• Program used to generate output for
confidence intervals for means and
prediction intervals is topic14.sas
Download