Chapter 15
Multiple Regression Model Building
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-1
In this chapter, you learn:
To use quadratic terms in a regression model
To measure the correlation among independent variables
To build a regression model, using either the stepwise or best-subsets approach
To avoid the pitfalls involved in developing a multiple regression model
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-2
The relationship between the dependent variable and an independent variable may not be linear
Review the scatter plot to check for non-linear relationships
Example: Quadratic model
i
0
1
1i
2
2
1i
i
The second independent variable is the square of the first variable
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-3
Y
i
0
1
1i
2
2
1i
i
Quadratic models may be considered when the scatter diagram takes on one of the following shapes:
Y Y Y
β
β
1
2
< 0
> 0
X
1 β
1
β
2
> 0
> 0
X
1 β
1
β
2
< 0
< 0
β
1
β
2
= the coefficient of the linear term
= the coefficient of the squared term
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
X
1 β
1
β
2
> 0
< 0
X
1
Chap 15-4
Multiple Regression with X
1 and X
2
A non-linear relationship
X2 Residual Plot
b
0
b
1
X
1
b
2
X
2
X
1
Residual plot was randomly distributed
0 2 4 6 8 10 12
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-5
Item, i
1
2
3
4
…
4
1
Y i
1
3
…
X
1i
1
8
3
5
…
X
2i
3
5
2
6
…
Multiply X
2 1 i by X
2 to get X
Run regression with Y i
, X
1
* X
2
, , X
.
2
2
, X
2i
1
X
2
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
X 2
2i
9
25
4
36
…
Chap 15-6
Collect data and use Excel to calculate the regression equation:
ˆ i
b
0
b
1
X
1i
b
2
X 2
1i
Test for Overall Relationship
H
H
0
1
: β
: β
1
1
= β
2
= 0 (no overall relationship between X and Y) and/or β
2
≠ 0 (there is a relationship between X and Y)
F-test statistic =
MSR
MSE
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-7
Testing the Quadratic Effect
Consider the quadratic regression equation
^
Y i
b
0
b
1
X
1i
b
2
2
X
1i
Hypotheses
H
0
: β
2
= 0
H
1
: β
2
0
(The quadratic term does not improve the model)
(The quadratic term improves the model)
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-8
Testing the Quadratic Effect
Hypotheses
H
0
: β
2
= 0
H
1
: β
2
0
(The quadratic term does not improve the model)
(The quadratic term improves the model)
The test statistic is t
b
2
β
2
S b
2 where: p -value=TDIST(t n-3
) b
2
= estimated slope
β
2
= hypothesized slope (zero) d.f.
n
3
S b2
= standard error of the slope
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-9
(continued)
A second test of the quadratic effect
Compare the adjusted r 2 from the simple regression to the adjusted r 2 from the quadratic model
If the adjusted r 2 of the quadratic model is greater than the adjusted r 2 of the simple linear regression then the quadratic model has explained a larger percentage of the variability in Y
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-10
70
78
85
87
99
40
54
67
Purity
3
7
8
15
22
33
14
15
15
16
17
2
3
5
Filter
Time
1
Purity increases as filter time increases:
Purity vs. Time
100
80
7
8
60
10
12
13
40
20
0
0
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
5 10
Time
15 20
Chap 15-11
Intercept
Time
Simple (linear) regression results: t statistic, F statistic, and adjusted r 2 are all high, but the residuals are not random:
Coefficients
-11.28267
5.98520
Standard
Error
3.46805
0.30966
t Stat P-value
-3.25332
0.00691
19.32819
2.078E-10
Time Residual Plot
R Square
Regression Statistics
0.96888
Adjusted R Square 0.96628
Standard Error 6.15997
F
373.57904
Significance F
2.0778E-10
10
5
0
-5 0
-10
5 10 15
Time
20
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-12
Quadratic regression results:
Y = 1.539 + 1.565 Time + 0.245 (Time) 2
Intercept
Time
Time-squared
Coefficients
1.53870
1.56496
0.24516
Standard
Error
2.24465
0.60179
0.03258
t Stat P-value
0.68550
2.60052
0.50722
0.02467
7.52406
1.165E-05
Regression Statistics
F Significance F
R Square 0.99494
1080.7330
2.368E-13
Adjusted R Square
Standard Error
0.99402
2.59513
The quadratic term is significant and improves the model: adjusted r 2 is higher and S
YX lower, residuals are now random.
is
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
10
5
0
-5
0
Time Residual Plot
5 10
Time
15
10
5
0
-5
0
Time-squared Residual Plot
100 200
Time-squared
300
Chap 15-13
400
20
Collinearity: High correlation exists among two or more independent variables
The correlated variables contribute redundant information to the multiple regression model
Including two highly correlated independent variables can adversely affect the regression results
No new information provided
Can lead to unstable coefficients (large standard error and low t-values)
Coefficient signs may not match prior expectations
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-14
Incorrect signs on the coefficients
Large change in the value of a previous coefficient when a new variable is added to the model
A previously significant variable becomes insignificant when a new independent variable is added
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-15
VIF j is used to measure collinearity:
VIF j
1
1
R
2 j where R 2 j is the coefficient of determination from a regression model that uses X j as the dependent variable and all other X variables as the independent variables
If VIF j
> 10 , X j is highly correlated with the other independent variables
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-16
12
13
14
10
11
8
9
6
7
4
5
Week
1
2
3
340
300
440
450
430
470
450
490
Pie
Sales
350
460
350
430
350
380
7.20
7.90
5.90
5.00
4.50
6.40
7.00
5.00
Price
($)
5.50
7.50
8.00
8.00
6.80
7.50
3.5
3.2
4.0
3.5
3.0
3.7
3.5
4.0
Advertising
($100s)
3.3
3.3
3.0
4.5
3.0
4.0
Recall the multiple regression model of chapter 13:
Sales = b
0
+ b
1
(Price)
+ b
2
Advertising)
Chap 15-17
PHStat / regression / multiple regression …
Check the “variance inflationary factor (VIF)” box
Regression Analysis
Price and all other X
Regression Statistics
Output for the pie sales example:
Since there are only two explanatory variables, only one
VIF is reported Multiple R
R Square
Adjusted R
Square
Standard Error
Observations
VIF
0.030438
0.000926
-0.075925
1.21527
15
1.000927
VIF is < 10
There is no evidence of collinearity between Price and
Advertising
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-18
Goal is to develop a model with the best set of independent variables
Easier to interpret if unimportant variables are removed
Lower probability of collinearity
Stepwise regression procedure
Provide evaluation of alternative models as variables are added
Best-subset approach
Try all combinations and select the model with the highest adjusted r 2
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-19
Idea: develop the least squares regression equation in steps, adding one independent variable at a time and evaluating whether existing variables should remain or be removed
The coefficient of partial determination is the measure of the marginal contribution of each independent variable, given that other independent variables are in the model
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-20
Idea: estimate all possible regression equations using all possible combinations of independent variables
Choose the best model by looking for the highest adjusted r 2
The model with the largest adjusted r 2 will also have the smallest S
YX
Stepwise regression and best subsets regression can be performed using Excel with PHStat add in
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-21
?
The C p
Statistic
C p
( 1
R k
2
1
)( n
R 2
T
T )
( n
2 ( k
1 ))
Where k = number of independent variables included in a particular regression model
T = total number of parameters to be estimated in the full regression model
R 2 k
= coefficient of multiple determination for model with k independent variables
R
2
T
= coefficient of multiple determination for full model with all T estimated parameters
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-22
1. Choose independent variables to include in the model
2. Estimate full model and check VIFs and check if any VIFs > 10
If no VIF > 10 , go to step 3
If one VIF > 10 , remove this variable
If more than one, eliminate the variable with the highest VIF and repeat step 2
3. Perform best subsets regression with remaining variables.
4. Sort all models by r 2
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-23
5. Choose the best model.
Consider parsimony.
Do extra variables make a significant contribution?
6. Perform complete analysis with chosen model, including residual analysis.
7. Transform the model if necessary to deal with violations of linearity or other model assumptions.
8. Use the model for prediction and inference.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-24
The final step in the model-building process is to validate the selected regression model.
Collect new data and compare the results.
Compare the results of the regression model to previous results.
If the data set is large, split the data into two parts and cross-validate the results.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-25
Choose X
1
,X
2
,…,X k
Run regression to find VIFs
Any
VIF>5?
No
Yes
Remove variable with highest
VIF
Yes
More than one?
No
Remove this X
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Run best subsets regression to obtain
“best” models in terms of C p
Do complete analysis
Add quadratic term and/or transform variables as indicated
Perform predictions
Chap 15-26
To avoid pitfalls and address ethical considerations:
Understand that interpretation of the estimated regression coefficients are performed holding all other independent variables constant
Evaluate residual plots for each independent variable
Evaluate interaction terms
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-27
To avoid pitfalls and address ethical considerations:
Obtain VIFs for each independent variable before determining which variables should be included in the model.
Examine several alternative models using best-subsets regression or step-wise regression.
Use other methods when the assumptions necessary for least-squares regression have been seriously violated.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-28
In this chapter, we have
Developed the quadratic regression model
Described collinearity
Discussed model building
Stepwise regression
Best subsets
Addressed pitfalls in multiple regression and ethical considerations
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.
Chap 15-29