Regresi dan Rancangan Faktorial Pertemuan 23 Matakuliah : I0174 – Analisis Regresi Tahun

advertisement
Matakuliah : I0174 – Analisis Regresi
Tahun
: Ganjil 2007/2008
Regresi dan Rancangan Faktorial
Pertemuan 23
Regresi dan Rancangan Faktorial
• Penyandian Ortogonal pada rancangan faktorial.
• Persamaan Regresi pada rancangan faktorial
Bina Nusantara
The Multiple Regression Model
Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population
Y-intercept
Population slopes
Random
error
Yi      X1i    X 2i    k X ki   i
Dependent (Response)
variable
Bina Nusantara
Independent (Explanatory)
variables
Multiple Regression Model
Bivariate model
Y
Response
Response
Plane
Plane
X
X11
Bina Nusantara
+  1X
YYi i= 00 
X1i1i + 22XX2i2+i i i
(Observed Y)
(Observed
Y)
 00
i
X22
X 1i ,,X
X
(X
1i 2i2)i
+ 1XX1i +
2X2i
Y| XY|X= 00 
1 1i
2 X 2i
Multiple Regression Equation
Yii =
+ b11X
X11ii 
+ bb22X 2i2i +eeii
 b0 
Bivariate model
Y
Y
Response
Response
Plane
Plane
X
X11
(Observed
(ObservedYY))
bb00
ei
X
X22
X 11ii , X2i2i)
(X
^
ˆ
+ b 2i
YYi i=bb00+bb1 X
1X11i
i  b22X2i
Bina Nusantara
Multiple Regression Equation
Interpretation of Estimated Coefficients
• Slope (bj )
– Estimated that the average value of Y changes by bj for each 1 unit
increase in Xj , holding all other variables constant (ceterus paribus)
– Example: If b1 = -2, then fuel oil usage (Y) is expected to decrease by
an estimated 2 gallons for each 1 degree increase in temperature
(X1), given the inches of insulation (X2)
• Y-Intercept (b0)
– The estimated average value of Y when all Xj = 0
Bina Nusantara
Multiple Regression Model: Example
Develop a model for estimating
heating oil used for a single
family home in the month of
January, based on average
temperature and amount of
insulation in inches.
Bina Nusantara
Oil (Gal) Temp (0F) Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
Multiple Regression Equation: Example
Yˆi  b0  b1 X1i  b2 X 2i 
Excel Output
Intercept
X Variable 1
X Variable 2
 bk X ki
Coefficients
562.1510092
-5.436580588
-20.01232067
Yˆi  562.151  5.437 X1i  20.012 X 2i
For each degree increase in
temperature, the estimated average
amount of heating oil used is
decreased by 5.437 gallons,
holding insulation constant.
Bina Nusantara
For each increase in one inch
of insulation, the estimated
average use of heating oil is
decreased by 20.012 gallons,
holding temperature constant.
Simple and Multiple Regression Compared
• The slope coefficient in a simple regression picks up the impact of the
independent variable plus the impacts of other variables that are
excluded from the model, but are correlated with the included
independent variable and the dependent variable
• Coefficients in a multiple regression net out the impacts of other
variables in the equation
– Hence, they are called the net regression coefficients
– They still pick up the effects of other variables that are excluded
from the model, but are correlated with the included independent
variables and the dependent variable
Bina Nusantara
Simple and Multiple Regression Compared: Example
• Two Simple Regressions:
–
–
Oil   0  1 Temp  
Oil   0   2 Insulation  
• Multiple Regression:
–
Bina Nusantara
Oil   0  1 Temp   2 Insulation  
Simple and Multiple Regression Compared: Slope
Coefficients
Oil  b0  b1 Temp  b2 Insulation  e
Intercept
Temp
Insulation
Coefficients
562.1510092
-5.436580588
-20.01232067
Oil  b0  b1 Temp  e
Intercept
Temp
Bina Nusantara
Coefficients
436.4382299
-5.462207697
-20.0123  -20.3503
Oil  b0  b2 Insulation  e
Intercept
Insulation
-5.4366  -5.4622
Coefficients
345.3783784
-20.35027027
Simple and Multiple Regression Compared: r2
Oil   0  1 Temp   2 Insulation  
Oil   0  1 Temp  
Regression Statistics
Multiple R
0.86974117
R Square
0.756449704
Adjusted R Square 0.737715065
Standard Error
66.51246564
Observations
15
Bina Nusantara
 0.97275

Regression Statistics
Multiple R
0.982654757
R Square
0.965610371
Adjusted R Square
0.959878766
Standard Error
26.01378323
Observations
15
0.96561 
 0.75645
 0.21630
Oil   0  1 Insulation  
Regression Statistics
Multiple R
0.465082527
R Square
0.216301757
Adjusted R Square 0.156017277
Standard Error
119.3117327
Observations
15
Example: Adjusted r2
Can Decrease
Oil   0  1 Temp   2 Insulation  
Regression Statistics
Multiple R
0.982654757
R Square
0.965610371
Adjusted R Square
0.959878766
Standard Error
26.01378323
Observations
15
Oil   0  1 Temp   2 Insulation  3 Color  
Regression Statistics
Multiple R
0.983482856
R Square
0.967238528
Adjusted R Square
0.958303581
Standard Error
25.72417272
Observations
15
Bina Nusantara
Adjusted r 2 decreases when
k increases from 2 to 3
Color is not useful in explaining
the variation in oil consumption.
Using the Regression Equation to Make Predictions
Predict the amount of heating oil used for a
home if the average temperature is 300 and
the insulation is 6 inches.
Yˆi  562.151  5.437 X 1i  20.012 X 2i
 562.151  5.437  30   20.012  6 
 278.969
Bina Nusantara
The predicted heating oil
used is 278.97 gallons.
Predictions in PHStat
• PHStat | Regression | Multiple Regression …
– Check the “Confidence and Prediction Interval Estimate”
box
• Excel spreadsheet for the heating oil example
Bina Nusantara
Residual Plots
• Residuals Vs Yˆ
– May need to transform Y variable
• Residuals Vs X1
– May need to transform
X1 variable
• Residuals Vs X 2
– May need to transform X 2variable
• Residuals Vs Time
– May have autocorrelation
Bina Nusantara
Residual Plots: Example
T em p eratu re R esid u al P lo t
Maybe some nonlinear relationship
60
Residuals
40
20
Insulation R esidual P lot
0
0
20
40
60
80
-20
-40
-60
0
No Discernable Pattern
Bina Nusantara
2
4
6
8
10
12
Testing for Overall Significance
• Shows if Y Depends Linearly on All of the X Variables Together as a
Group
• Use F Test Statistic
• Hypotheses:
– H0:     …  k = 0 (No linear relationship)
– H1: At least one i   ( At least one independent variable
affects Y )
• The Null Hypothesis is a Very Strong Statement
• The Null Hypothesis is Almost Always Rejected
Bina Nusantara
Testing for Overall Significance
(continued)
• Test Statistic:
MSR SSR  all  / k
– F 

MSE
MSE  all 
• Where F has k numerator and (n-k-1) denominator degrees of
freedom
Bina Nusantara
Test for Overall Significance
Excel Output: Example
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
2 228014.6 114007.3 168.4712
1.65411E-09
12 8120.603 676.7169
14 236135.2
k = 2, the number of
explanatory variables
p-value
n-1
MSR
 F Test Statistic
MSE
Bina Nusantara
Test for Overall Significance:
Example Solution
H0: 1 = 2 = … = k = 0
H1: At least one j  0
Test Statistic:
F 
 = .05
df = 2 and 12
168.47
(Excel Output)
Decision:
Reject at  = 0.05.
Critical Value:
Conclusion:
 = 0.05
0
Bina Nusantara
3.89
F
There is evidence that at
least one independent
variable affects Y.
Test for Significance:
Individual Variables
• Show If Y Depends Linearly on a Single Xj Individually While Holding the
Effects of Other X’s Fixed
• Use t Test Statistic
• Hypotheses:
– H0: j  0 (No linear relationship)
– H1: j  0 (Linear relationship between Xj and Y)
Bina Nusantara
t Test Statistic
Excel Output: Example
t Test Statistic for X1
(Temperature)
Coefficients Standard Error
t Stat
Intercept
562.1510092
21.09310433 26.65094
Temp
-5.436580588
0.336216167 -16.1699
Insulation -20.01232067
2.342505227 -8.543127
bi
t
Sbi
Bina Nusantara
P-value
4.77868E-12
1.64178E-09
1.90731E-06
t Test Statistic for X2
(Insulation)
t Test : Example Solution
Does temperature have a significant effect on monthly
consumption of heating oil? Test at  = 0.05.
Test Statistic:
H0: 1 = 0
t Test Statistic = -16.1699
H1: 1  0
Decision:
Reject H0 at  = 0.05.
df = 12
Critical Values:
Reject H0
Reject H0
.025
.025
-2.1788
Bina Nusantara
0 2.1788
t
Conclusion:
There is evidence of a
significant effect of
temperature on oil
consumption holding constant
the effect of insulation.
Venn Diagrams and
Estimation of Regression Model
Only this
information is
used in the
estimation of
1
Oil
Only this
information is
used in the
estimation of  2
Temp
Insulation
Bina Nusantara
This
information
is NOT used
in the
estimation
of 1 nor  2
Confidence Interval Estimate for the Slope
Provide the 95% confidence interval for the population
slope 1 (the effect of temperature on oil consumption).
Intercept
Temp
Insulation
Coefficients
562.151009
-5.4365806
-20.012321
b1  tn  p 1Sb1
Lower 95% Upper 95%
516.1930837 608.108935
-6.169132673 -4.7040285
-25.11620102
-14.90844
-6.169  1  -4.704
We are 95% confident that the estimated average consumption of
oil is reduced by between 4.7 gallons to 6.17 gallons per each
increase of 10 F holding insulation constant.
We can also perform the test for the significance of individual
variables, H0: 1 = 0 vs. H1: 1  0, using this confidence interval.
Bina Nusantara
Contribution of a Single
Independent Variable X j
• Let Xj Be the Independent Variable of Interest
• SSR X | all others except X
j
j


 SSR  all   SSR  all others except X j 
– Measures the additional contribution of Xj in explaining the total
variation in Y with the inclusion of all the remaining independent
variables
Bina Nusantara
Contribution of a Single Independent Variable X k
SSR  X 1 | X 2 and X 3 
 SSR  X 1 , X 2 and X 3   SSR  X 2 and X 3 
From ANOVA section of
regression for
Yˆi  b0  b1 X1i  b2 X 2i  b3 X 3i
From ANOVA section
of regression for
Yˆi  b0  b2 X 2i  b3 X 3i
Measures the additional contribution of X1 in
explaining Y with the inclusion of X2 and X3.
Bina Nusantara
Coefficient of Partial Determination of X j
2
r
• Yj all others 
SSR  X j | all others 
SST  SSR  all   SSR  X j | all others 
• Measures the proportion of variation in the dependent variable that
is explained by Xj while controlling for (holding constant) the other
independent variables
Bina Nusantara
Coefficient of Partial Determination for X j
(continued)
Example: Model with two independent variables
2
Y 1 2
r
Bina Nusantara
SSR  X 1 | X 2 

SST  SSR  X 1 , X 2   SSR  X 1 | X 2 
Venn Diagrams and Coefficient of Partial Determination for X j
2
Y1  2
r
SSR  X1 | X 2 
Oil

SSR  X1 | X 2 
SST  SSR  X 1 , X 2   SSR  X 1 | X 2 
=
Temp
Insulation
Bina Nusantara
Coefficient of Partial Determination in PHStat
• PHStat | Regression | Multiple Regression …
– Check the “Coefficient of Partial Determination” box
• Excel spreadsheet for the heating oil example
Bina Nusantara
Contribution of a Subset of Independent Variables
• Let Xs Be the Subset of Independent Variables of Interest
– SSR  X s | all others except X s 
 SSR  all   SSR  all others except X s 
– Measures the contribution of the subset Xs in explaining SST
with the inclusion of the remaining independent variables
Bina Nusantara
Contribution of a Subset of Independent Variables:
Example
Let Xs be X1 and X3
SSR  X 1 and X 3 | X 2 
 SSR  X 1 , X 2 and X 3   SSR  X 2 
From ANOVA section of
regression for
Yˆi  b0  b1 X1i  b2 X 2i  b3 X 3i
Bina Nusantara
From ANOVA
section of
regression for
Yˆi  b0  b2 X 2i
Testing Portions of Model
• Examines the Contribution of a Subset Xs of Explanatory
Variables to the Relationship with Y
• Null Hypothesis:
– Variables in the subset do not improve the model
significantly when all other variables are included
• Alternative Hypothesis:
– At least one variable in the subset is significant when all
other variables are included
Bina Nusantara
Testing Portions of Model
(continued)
• One-Tailed Rejection Region
• Requires Comparison of Two Regressions
– One regression includes everything
– Another regression includes everything except the portion to be
tested
Bina Nusantara
Partial F Test for the Contribution of a Subset of X Variables
• Hypotheses:
– H0 : Variables Xs do not significantly improve the model given all
other variables included
– H1 : Variables Xs significantly improve the model given all others
included
• Test Statistic:
–
SSR  X s | all others  / m
F
MSE  all 
– with df = m and (n-k-1)
– m = # of variables in the subset Xs
Bina Nusantara
Partial F Test for the Contribution of a Single X j
• Hypotheses:
– H0 : Variable Xj does not significantly improve the model given all
others included
– H1 : Variable Xj significantly improves the model given all others
included
• Test Statistic:
SSR  X j | all others 
–
F
MSE  all 
– with df = 1 and (n-k-1 )
– m = 1 here
Bina Nusantara
Testing Portions of Model: Example
Test at the  = .05
level to determine if
the variable of
average temperature
significantly improves
the model, given that
insulation is included.
Bina Nusantara
Testing Portions of Model: Example
H0: X1 (temperature) does
not improve model with X2
(insulation) included
 = .05, df = 1 and 12
Critical Value = 4.75
H1: X1 does improve model
ANOVA
(For X1 and X2)
ANOVA
(For X2)
Regression
Residual
Total
SS
MS
228014.6263 114007.313
8120.603016 676.716918
236135.2293
SS
Regression 51076.47
Residual
185058.8
Total
236135.2
SSR  X 1 | X 2   228, 015  51, 076 
F

 261.47
MSE  X 1 , X 2 
676.717
Bina Nusantara
Conclusion: Reject H0; X1 does improve model.
Dummy-Variable Models
•
•
•
•
•
•
•
Categorical Explanatory Variable with 2 or More Levels
Yes or No, On or Off, Male or Female,
Use Dummy-Variables (Coded as 0 or 1)
Only Intercepts are Different
Assumes Equal Slopes Across Categories
The Number of Dummy-Variables Needed is (# of Levels - 1)
Regression Model Has Same Form:
Yi   0  1 X1i   2 X 2i       k X ki   i
Bina Nusantara
Dummy-Variable Models
(with 2 Levels)
Given: Yˆi  b0  b1 X1i  b2 X 2i
Y = Assessed Value of House
X1 = Square Footage of House
X2 = Desirability of Neighborhood =
Desirable (X2 = 1)
Yˆi  b0  b1 X1i  b2 (1)  (b0  b2 )  b1 X1i
Undesirable (X2 = 0)
Yˆ  b  b X  b (0)  b  b X
i
Bina Nusantara
0
1
1i
2
0
1
1i
0 if
undesirable
1 if desirable
Same
slopes
Dummy-Variable Models
(with 2 Levels)
(continued)
Y (Assessed Value)
Same
slopes
b1
b0 + b2
Intercepts
different
Bina Nusantara
b0
X1 (Square footage)
Interpretation of the Dummy-Variable Coefficient (with 2
Levels)
Example:
Yˆi  b0  b1 X1i  b2 X 2i  20  5 X1i  6 X 2i
Y : Annual salary of college graduate in thousand $
X1 : GPA
X 2:
0 non-business degree
1 business degree
With the same GPA, college graduates with a business
degree are making an estimated 6 thousand dollars more
than graduates with a non-business degree, on average.
Bina Nusantara
Given:
Dummy-Variable Models
(with 3 Levels)
Y  Assessed Value of the House (1000 $)
X 1  Square Footage of the House
Style of the House = Split-level, Ranch, Condo
(3 Levels; Need 2 Dummy Variables)
1 if Split-level
1 if Ranch
X2  
X3  
 0 if not
 0 if not
Yˆi  b0  b1 X 1  b2 X 2  b3 X 3
Bina Nusantara
Interpretation of the Dummy-Variable Coefficients (with 3
Levels)
Given the Estimated Model:
Yˆi  20.43  0.045 X 1i  18.84 X 2i  23.53 X 3i
For Split-level  X 2  1 :
Yˆi  20.43  0.045 X 1i  18.84
For Ranch  X 3  1 :
Yˆi  20.43  0.045 X 1i  23.53
For Condo:
Yˆ  20.43  0.045 X
i
Bina Nusantara
1i
With the same footage, a Splitlevel will have an estimated
average assessed value of 18.84
thousand dollars more than a
Condo.
With the same footage, a Ranch
will have an estimated average
assessed value of 23.53
thousand dollars more than a
Condo.
Regression Model Containing
an Interaction Term
• Hypothesizes Interaction between a Pair of X Variables
– Response to one X variable varies at different levels of another X
variable
• Contains a Cross-Product Term
– Yi   0  1 X1i   2 X 2i   3 X 1i X 2i   i
• Can Be Combined with Other Models
– E.g., Dummy-Variable Model
Bina Nusantara
Effect of Interaction
• Given:
– Yi   0  1 X 1i   2 X 2i   3 X 1i X 2i   i
• Without Interaction Term, Effect of X1 on Y is Measured by 1
• With Interaction Term, Effect of X1 on Y is Measured by 1 + 3 X2
• Effect Changes as X2 Changes
Bina Nusantara
Interaction Example
Y
Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12
8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0
X1
0
0.5
1
1.5
Effect (slope) of X1 on Y depends on X2 value
Bina Nusantara
Interaction Regression Model Worksheet
Case, i
Yi
X1i
X2i
X1i X2i
1
1
1
3
3
2
3
4
1
8
3
5
2
40
6
4
3
5
6
30
:
:
:
:
:
Multiply X1 by X2 to get X1X2
Run regression with Y, X1, X2 , X1X2
Bina Nusantara
Interpretation When There Are 3+ Levels
Y   0  1MALE   2 MARRIED   3DIVORCED
  4 MALE  MARRIED   5 MALE  DIVORCED
MALE = 0 if female and 1 if male
MARRIED = 1 if married; 0 if not
DIVORCED = 1 if divorced; 0 if not
MALE•MARRIED = 1 if male married; 0 otherwise
= (MALE times MARRIED)
MALE•DIVORCED = 1 if male divorced; 0 otherwise
= (MALE times DIVORCED)
Bina Nusantara
Interpretation When There Are 3+ Levels (continued)
Y   0  1MALE   2 MARRIED   3DIVORCED
  4 MALE  MARRIED   5 MALE  DIVORCED
SINGLE
MARRIED
DIVORCED
FEMALE

  2
   3
MALE
   1     1
2  4
Bina Nusantara
   1
 3  5
Interpreting Results
FEMALE
Single:
Married:
Divorced:
MALE
Difference
0
1
Single:  0  1
 0   2 Married:  0  1   2   4
1  4
 0   3 Divorced:  0  1   3   5 1  5
Main Effects : MALE, MARRIED and DIVORCED
Interaction Effects : MALE•MARRIED and
MALE•DIVORCED
Bina Nusantara
Evaluating the Presence of Interaction with DummyVariable
• Suppose X1 and X2 are Numerical Variables and X3 is a Dummy-Variable
• To Test if the Slope of Y with X1 and/or X2 are the Same for the Two
Levels of X3
• Model:
Yi  0  1 X 1i   2 X 2i  3 X 3i   4 X 1i X 3i  5 X 2i X 3i   i
• Hypotheses:
– H0: 4 = 5 = 0 (No Interaction between X1 and X3 or X2 and X3 )
– H1: 4 and/or 5  0 (X1 and/or X2 Interacts with X3)
• Perform a Partial F Test
SSR( X 1 , X 2 , X 3 , X 4 , X 5 )  SSR( X 1 , X 2 , X 3 )  / 2

F
MSE ( X 1 , X 2 , X 3 , X 4 , X 5 )
Bina Nusantara
Evaluating the Presence of Interaction with Numerical
Variables
• Suppose X1, X2 and X3 are Numerical Variables
• To Test If the Independent Variables Interact with Each Other
• Model:
Yi  0  1 X 1i  2 X 2i  3 X 3i  4 X 1i X 2i  5 X 1i X 3i  6 X 2i X 3i   i
• Hypotheses:
– H0: 4 = 5 = 6 = 0 (no interaction among X1, X2 and X3 )
– H1: at least one of 4, 5, 6  0 (at least one pair of X1, X2, X3
interact with each other)
• Perform a Partial F Test
F
Bina Nusantara
 SSR( X 1, X 2 , X 3 , X 4 , X 5 , X 6 )  SSR( X 1, X 2 , X 3 )  / 3
MSE ( X 1 , X 2 , X 3 , X 4 , X 5 , X 6 )
Download