Chapter13

advertisement
Chapter 13
Multiple
Regression
Analysis
© 2002 Thomson / South-Western
Slide 13-1
Learning Objectives
• Develop a multiple regression model.
• Understand and apply techniques that can be
used to determine how well a regression
model fits data.
• Analyze and interpret nonlinear variables and
how to use them in multiple regression
analysis.
• Understand the role of qualitative variables
and how to use them in multiple regression
analysis.
• Learn how to build and evaluate multiple
regression models.
© 2002 Thomson / South-Western
Slide 13-2
The Multiple Regression Model
• Multiple regression is regression
analysis with one dependent variable
and two or more independent variables,
or at least one nonlinear independent
variable.
• The response variable is the
dependent variable, the variable that the
business analyst is trying to predict.
© 2002 Thomson / South-Western
Slide 13-3
Regression Models
 Probabilistic Multiple Regression Model
Y = b0 + b1X1 + b2X2 + b3X3 + . . . + bkXk+ 
Y = the value of the dependent (response) variable
b0 = the regression constant
b1 = the partial regression coefficient of independent variable 1
b2 = the partial regression coefficient of independent variable 2
bk = the partial regression coefficient of independent variable k
k = the number of independent variables
 = the error of prediction
© 2002 Thomson / South-Western
Slide 13-4
Estimated Regression Model
Yˆ  b 0  b1 X 1  b 2 X 2  b 3 X 3 
 bk X k
where : Yˆ  predicted value of Y
0
 estimate of regression constant
1
 estimate of regression coefficient 1
2
 estimate of regression coefficient 2
3
 estimate of regression coefficient 3
k
 estimate of regression coefficient k
b
b
b
b
b
k = number of independent variables
© 2002 Thomson / South-Western
Slide 13-5
Multiple Regression Model with Two
Independent Variables (First-Order)
Y   0  1 X 1   2 X 2  
where:



0
1
2
Population
Model
= the regression constant
 the partial regression coefficient for independent variable 1
 the partial regression coefficient for independent variable 2
 = the error of prediction
Y  b  b X  b X
0
1
1
2
2
where: Y  predicted value of Y
b
b
b
0
estimate of regression constant
1
estimate of regression coefficient 1
2
estimate of regression coefficient 2
© 2002 Thomson / South-Western
Estimated
Model
Slide 13-6
Response Plane for First-Order TwoPredictor Multiple Regression Model
Y
Vertical Intercept
Y1

Response Plane
X2
© 2002 Thomson / South-Western
X1
Slide 13-7
Least Squares Equations for k = 2
Least squares analysis is the process by which a
regression model is developed based on calculus
techniques that attempt to produce a minimum
sum of the squared error values.
b n  b  X  b  X  Y
b  X b  X b  X X   X Y
b  X b  X X b  X   X Y
0
1
1
2
2
2
0
1
1
1
2
1
2
1
2
0
2
© 2002 Thomson / South-Western
1
1
2
2
2
2
Slide 13-8
Real Estate Data
Observation
1
2
3
4
5
6
7
8
9
10
11
12
Market
Price
($1,000)
Y
63.0
65.1
69.9
7
76.8
73.9
77.9
74.9
78.0
79.0
63.4
79.5
83.9
Square
Feet
X1
1,605
2,489
1,553
2,404
1,884
1,558
1,748
3,105
1,682
2,470
1,820
2,143
© 2002 Thomson / South-Western
Age
(Years)
X2
35
45
20
32
25
14
8
10
28
30
2
6
Observation
13
14
15
16
17
18
19
20
21
22
23
Market
Price
($1,000)
Y
79.7
84.5
96.0
109.5
102.5
121.0
104.9
128.0
129.0
117.9
140.0
Square
Feet
Age
(Years)
X1
2,121
2,485
2,300
2,714
2,463
3,076
3,048
3,267
3,069
4,765
4,540
Slide 13-9
X2
14
9
19
4
5
7
3
6
10
11
8
Predicting the Price of Home
Yˆ  57.351  0.0177 X 1  0.663 X 2
For X 1  2500 and X 2  12,
Yˆ  57.351  0.0177  2500   0.663 12 
 93.605 thousand dollars
© 2002 Thomson / South-Western
Slide 13-10
Evaluating the Multiple
Regression Model
H 0:  1   2   3    k  0
Ha: At least one of the regression coefficients is  0

H :
H 0:
a

H :
H 0:
a
1
0
0
1

H :
H 0:
a

 0 H 0:
2
2
0

H :
a
3
0
0
3
k
k
© 2002 Thomson / South-Western
Testing
the
Overall
Model
0
Significance
Tests for
Individual
Regression
Coefficients
0
Slide 13-11
Testing the Overall Model for the
Real Estate Example
H 0 :   2  0
1
Ha : At least one of the regression coefficien ts is  0
SSR
MSR 
k
F
F
.01,2 ,20
Cal
 585
.
 28.63  585
. , reject H0.
SSE
MSR
MSE 
F
n  k 1
MSE
ANOVA
Regression
Residual (Error)
Total
df
SS
MS
2 8189.723 4094.862
20 2861.017
143.051
22 11050.740
© 2002 Thomson / South-Western
F
28.63
p
.0000014
Slide 13-12
H 0:  1  0
Ha:  1  0
H 0:  2  0
Ha:  2  0
t.025,20 = 2.086
tCal = 5.63 > 2.086, reject H0.
Coefficients Std Dev
x1 (Sq.Feet)
x2 (Age)
Significance Test
of the Regression
Coefficients for
the Real Estate
Example
0.0177
-0.666
© 2002 Thomson / South-Western
0.003146
0.2280
t Stat
5.63
-2.92
p
.000016
.008418
Slide 13-13
Residuals
• The residual is the difference between
the actual Y value and Y value predicted
by the regression model.
• It is the error of the regression model in
prediting each value of the dependent
variable.
© 2002 Thomson / South-Western
Slide 13-14
SSE and Standard Error
of the Estimate
ANOVA
Regression
Residual (Error)
Total
S
e

df
SS
2 8189.7
20 2861.0
22 11050.7
MS
4094.9
143.1
SSE
n  k 1
F
28.63
P
.000
SSE
2861
23  2  1
 11.96

where: n = number of observations
k = number of independent variables
© 2002 Thomson / South-Western
Slide 13-15
Coefficient of Multiple Determination (R2)
SSYY
ANOVA
Regression
Residual (Error)
Total
SSR
SSE
df
SS
2 8189.7
20 2861.0
22 11050.7
MS
4094.89
143.1
F
28.63
p
.000
SSR 8189.723
R  SSY  11050.74 .741
SSE
2861017
.
2
R  1  SSY  1  11050.74 .741
2
© 2002 Thomson / South-Western
Slide 13-16
Adjusted R2
n-1
n-k-1
ANOVA
Regression
Residual (Error)
Total
SSE
df
SS
MS
2 8189.723 4094.862
20 2861.017 143.051
22 11050.740
F
28.63
SSYY
p
.0000014
SSE
2861017
.
2
adj. R  1  n  k  1  1  23  2  1  1.285 .715
SSY
11050.74
n 1
23  1
© 2002 Thomson / South-Western
Slide 13-17
Indicator (Dummy) Variables
• Qualitative (indicator or dummy) Variables
• The number of dummy variables needed for a
qualitative variable is the number of categories
less one.
• For dichotomous variables, such as gender,
only one dummy variable is needed. There are
two categories (female, male); c = 1; c - 1 = 0.
• Your office is located in which region of the
country?
___Northeast___ Midwest
___South___West
Number
of dummy variables = c - 1 = 4 - 1 = 3
© 2002 Thomson / South-Western
Slide 13-18
Data for the Monthly Salary Example
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
© 2002 Thomson / South-Western
Monthly
Salary
($1000)
1.548
1.629
1.011
1.229
1.746
1.528
1.018
1.190
1.551
0.985
1.610
1.432
1.215
0.990
1.585
Age
(10 Years)
3.2
3.8
2.7
3.4
3.6
4.1
3.8
3.4
3.3
3.2
3.5
2.9
3.3
2.8
3.5
Gender
(1=Male,
0=Female)
1
1
0
0
1
1
0
0
1
0
1
1
0
0
1
Slide 13-19
Regression Output
for the Monthly Salary Example
The regression equation is
Salary = 0.732 + 0.111 Age + 0.459 Gender
Predictor
Constant
Age
Gender
Coef
0.7321
0.11122
0.45868
S = 0.09679
StDev
0.2356
0.07208
0.05346
R-Sq = 89.0%
T
P
3.11 0.009
1.54 0.149
8.58 0.000
R-Sq(adj) = 87.2%
Analysis of Variance
Source
Regression
Error
Total
© 2002 Thomson / South-Western
DF
2
12
14
SS
0.90949
0.11242
1.02191
MS
F
P
0.45474 48.54 0.000
0.00937
Slide 13-20
Regression Model Depicted
with Males and Females Separated
1.800
1.600
Males
1.400
1.200
Females
1.000
0.800
0
2
© 2002 Thomson / South-Western
3
4
Slide 13-21
More Complex Regression Models
Y   0  1 X 1   2 X 2  
First-order with Two Independent Variables
Y   0  1 X 1   2 X 1  
Second-order with One Independent Variable
2
Y   0  1 X 1   2 X 2   3 X 1 X 2  
Second-order with an
Interaction Term
Y   0  1 X 1   2 X 2   3 X 1   4 X 2   5 X 1 X 2  
2
© 2002 Thomson / South-Western
2
Second-order with
Two Independent
Variables
Slide 13-22
Example: Sales Data and Scatter
Plot for 13 Manufacturing Companies
Sales
Manufacturer ($1,000,000)
1
2.1
2
3.6
3
6.2
4
10.4
5
22.8
6
35.6
7
57.1
8
83.5
9
109.4
10
128.6
11
196.8
12
280.0
13
462.3
Number of
Manufacturing
Representatives
2
1
2
3
4
4
5
5
6
7
8
10
11
© 2002 Thomson / South-Western
500
450
400
350
300
Sales 250
200
150
100
50
0
0
2
4
6
8
10
Number of Representatives
Slide 13-23
12
Excel Simple Linear
Regression Output
for the Manufacturing
Example
Regression Statistics
Multiple R
0.933
R Square
0.870
Adjusted R Square
0.858
Standard Error
51.10
Observations
Coefficients Standard Error
-107.03
28.737
41.026
4.779
Intercept
numreps
t Stat
-3.72
8.58
13
P-value
0.003
0.000
ANOVA
df
Regression
Residual
Total
1
11
12
© 2002 Thomson / South-Western
SS
192395
28721
221117
MS
192395
2611
F
73.69
Significance F
0.000
Slide 13-24
Manufacturing Data
with Newly Created Variable
Sales
Manufacturer ($1,000,000)
1
2.1
2
3.6
3
6.2
4
10.4
5
22.8
6
35.6
7
57.1
8
83.5
9
109.4
10
128.6
11
196.8
12
280.0
13
462.3
© 2002 Thomson / South-Western
Number of
Mgfr Reps
X1
2
1
2
3
4
4
5
5
6
7
8
10
11
(No. Mgfr Reps)2
X2 = (X1)2
4
1
4
9
16
16
25
25
36
49
64
100
121
Slide 13-25
Scatter Plots Using Original
and Transformed Data
Sales
Sales
500
450
400
350
300
250
200
150
100
50
0
0
2
4
6
8
Number of Representatives
© 2002 Thomson / South-Western
10
12
500
450
400
350
300
250
200
150
100
50
0
0
50
100
150
Number of Mfg. Reps. Squared
Slide 13-26
Excel Output for
Quadratic Model to
Predict Sales
Intercept
MfgrRp
MfgrRpSq
Regression Statistics
Multiple R
0.986
R Square
0.973
Adjusted R Square 0.967
Standard Error
24.593
Observations
13
Coefficients Standard Error
18.067
24.673
-15.723
9.5450
4.750
0.776
t Stat
0.73
- 1.65
6.12
P-value
0.481
0.131
0.000
ANOVA
df
Regression
Residual
Total
2
10
12
© 2002 Thomson / South-Western
SS
215069
6048
221117
MS
107534
605
F
177.79
Significance F
0.000
Slide 13-27
Regression
Models with
Interaction
Example:
Prices of Three
Stocks over a
15-Month
Period
© 2002 Thomson / South-Western
Stock 1
Stock 2
Stock 3
41
36
35
39
36
35
38
38
32
45
51
41
41
52
39
43
55
55
47
57
52
49
58
54
41
62
65
35
70
77
36
72
75
39
74
74
33
83
81
28
101
92
31
107
91
Slide 13-28
Regression Models for the Three Stocks
Y   0  1 X 1   2
X
2

First-order with
Two Independent Variables
where: Y = price of stock 1
X
X
1
price of stock 2
2
price of stock 3
Y   0  1 X 1   2
X  X X
Y    X  X  X 
0
1
1
2
2
3
1
2
3
3
2

Second-order with an
Interaction Term
where: Y = price of stock 1
X
X
X
1
 price of stock 2
2
 price of stock 3
3

X X
1
2
© 2002 Thomson / South-Western
Slide 13-29
Regression for Three Stocks:
Two Predictors, No Interaction
The regression equation is
Stock 1 = 50.9 - 0.119 Stock 2 - 0.071 Stock 3
Predictor
Coef
Constant
50.855
Stock 2
-0.1190
Stock 3
-0.0708
S = 4.570
StDev
3.791
0.1931
0.1990
R-Sq = 47.2%
T
P
13.41 0.000
-0.62 0.549
-0.36 0.728
R-Sq(adj) = 38.4%
Analysis of Variance
Source
Regression
Error
Total
DF
2
12
14
© 2002 Thomson / South-Western
SS
224.29
250.64
474.93
MS
112.15
20.89
F Sig. F
5.37 0.022
Slide 13-30
Regression for Three Stocks:
with Interaction Term
The regression equation is
Stock 1 = 12.0 - 0.879 Stock 2 - 0.220 Stock 3 – 0.00998
Inter
Predictor
Coef
StDev
Constant
12.046
9.312
Stock 2
0.8788
0.2619
Stock 3
0.2205 0.1435
Inter
-0.009985 0.002314
S = 2.909
R-Sq = 80.4%
T
1.29
3.36
1.54
-4.31
P
0.222
0.006
0.153
0.001
R-Sq(adj) = 25.1%
Analysis of Variance
Source
DF
SS
Regression
3
381.85
Error
11
93.09
© 2002 Thomson / South-Western
Total
14
474.93
MS
127.28
8.46
F Sig. F
15.04 0.000
Slide 13-31
Nonlinear Regression Models:
Model Transformation
Y   
Yˆ  b b log b
X
0
1
X
0
Yˆ
'
b
b
0
'
1
© 2002 Thomson / South-Western
1
 b0  b1 X
'
where :
'
1
Yˆ
'
'
 log Yˆ
 log b0
 log b1
Slide 13-32
Data Set for Model
Transformation Example
ORIGINAL DATA
Company
1
2
3
4
5
6
7
Y
2580
11942
9845
27800
18926
4800
14550
X
1.2
2.6
2.2
3.2
2.9
1.5
2.7
Y = Sales ($ million/year)
© 2002 Thomson / South-Western
TRANSFORMED DATA
Company
1
2
3
4
5
6
7
LOG Y
3.41162
4.077077
3.993216
4.444045
4.277059
3.681241
4.162863
X
1.2
2.6
2.2
3.2
2.9
1.5
2.7
X = Advertising ($ million/year)
Slide 13-33
Regression Statistics
Multiple R
0.990
R Square
0.980
Adjusted R Square
0.977
Standard Error
0.054
Observations
7
Regression
Output for Model
Transformation
Example
Coefficients Standard Error
2.9003
0.0729
0.4751
0.0300
Intercept
X
t Stat
39.80
15.82
P-value
0.000
0.000
ANOVA
df
Regression
Residual
Total
1
5
6
SS
0.7392
0.0148
0.7540
© 2002 Thomson / South-Western
MS
0.7392
0.0030
F
250.36
Significance F
0.000
Slide 13-34
Prediction with the
Transformed Model
X
Yˆ  b 0b1
log Yˆ  log b 0  X log b1
 2.900364  X  0.475127 
For X=2,
log Yˆ  2.900364   2  0.475127 
 3.850618
Yˆ  antilog(log Yˆ )
 antilog(3.850618)
 7087.61
© 2002 Thomson / South-Western
Slide 13-35
Prediction with the
Transformed Model
X
Yˆ  b 0b1
log Yˆ  log b 0  X log b1
 2.900364  X  0.475127 
log b 0  2.900364
0
 antilog(2.900364)  794.99427
1
 0.475127
1
 antilog(0.475127)  2.986256
b
log b
b
For X =2,
Yˆ   794.99427   2.986256 
2
 7089.5
© 2002 Thomson / South-Western
Slide 13-36
Model-Building: Search Procedures
•
•
•
•
All Possible Regressions
Forward Selection
Backward Elimination
Stepwise Regression
© 2002 Thomson / South-Western
Slide 13-37
Data for Multiple
Regression to
Predict Crude Oil
Production
Y
X1
X2
X3
X4
World Crude
Oil Production
U.S. Energy
Consumption
U.S. Nuclear
Generation
U.S. Coal
Production
U.S. Fuel Rate
for Autos
© 2002 Thomson / South-Western
Y
55.7
55.7
52.8
57.3
59.7
60.2
62.7
59.6
56.1
53.5
53.3
54.5
54.0
56.2
56.7
58.7
59.9
60.6
60.2
60.2
60.6
60.9
X1
74.3
72.5
70.5
74.4
76.3
78.1
78.9
76.0
74.0
70.8
70.5
74.1
74.0
74.3
76.9
80.2
81.3
81.3
81.1
82.1
83.9
85.6
X2
X3
83.5 598.6
114.0 610.0
172.5 654.6
191.1 684.9
250.9 697.2
276.4 670.2
255.2 781.1
251.1 829.7
272.7 823.8
282.8 838.1
293.7 782.1
327.6 895.9
383.7 883.6
414.0 890.3
455.3 918.8
527.0 950.3
529.4 980.7
576.9 1029.1
612.6 996.0
618.8 997.5
610.3 945.4
640.4 1033.5
X4
13.30
13.42
13.52
13.53
13.80
14.04
14.41
15.46
15.94
16.65
17.14
17.83
18.20
18.27
19.20
19.87
20.31
21.02
21.69
21.68
21.04
21.48
Slide 13-38
Example: All Possible Regressions
with Four Independent Variables
Single
Predictor
X1
X2
X3
X4
Two
Predictors
X1, X2
X1, X3
X1, X4
X2, X3
X2, X4
X3, X4
© 2002 Thomson / South-Western
Three
Predictors
X1, X2, X3
X1, X2, X4
X1, X3, X4
X2, X3, X4
Four
Predictors
X1, X2, X3, X4
Slide 13-39
Forward Selection
Like stepwise, except variables
are not reevaluated after entering
the model
© 2002 Thomson / South-Western
Slide 13-40
Backward Elimination
• Start with the “full model” (all k
predictors)
• If all predictors are significant, stop
• Otherwise, eliminate the most
nonsignificant predictor; return to
previous step
© 2002 Thomson / South-Western
Slide 13-41
Stepwise Regression
• Perform k simple regressions; and
select the best as the initial model
• Evaluate each variable not in the
model
– If none meet the criterion, stop
– Add the best variable to the model;
reevaluate previous variables, and drop
any which are not significant
• Return to previous step
© 2002 Thomson / South-Western
Slide 13-42
Multicollinearity
Condition that occurs when two or more
of the independent variables of a
multiple regression model are highly
correlated
– Difficult to interpret the estimates of the
regression coefficients
– Inordinately small t values for the
regression coefficients may result
– Standard deviations of regression
coefficients are overestimated
– Sign of predictor variable’s coefficient
opposite of what expected
© 2002 Thomson / South-Western
Slide 13-43
Download