Multiple Regression
Analysis
Chapter 14
McGraw-Hill/Irwin
Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Topics
1.
Multiple Regression



2.
Regression Assumptions and Regression
Diagnostics




3.
4.
Estimation
Global Test
Individual Coefficient Test
Error Term Distribution
Multicollinearity
Heteroscedascity
Autocorrelation
Dummy Variable
Stepwise Regression
13-2
Multiple Regression Analysis
Multiple Linear Regression Model: Y = α + β1X1 + β2X2+ ··· +βkXk+ ε
•Y is the dependent variable and X1, X2, … Xk are the independent
variable.
• α, β1, β2, …, βk are population coefficients that need to be
estimated using sample data.
• ε is the error term.
• The model represents the linear relationship between the two
variables in the population
Estimated Regression Equation:
Yˆ = a + b1X1 + b2X2+ + ··· +bkXk
• a and b1, b2, …, bk are estimated coefficients from the sample.
• bi is the net change in Y for each unit change in Xi holding other X’s constant.
• The least squares criterion is used to develop this equation.
14-3
Multiple Linear Regression - Example
Salsberry Realty sells homes along the
east coast of the United States. One of
the questions most frequently asked by
prospective buyers is: If we purchase
this home, how much can we expect to
pay to heat it during the winter? The
research department at Salsberry has
been asked to develop some guidelines
regarding heating costs for single-family
homes.
Y
X1
X2
X3
Three variables are thought to relate to
the heating costs: (1) the mean daily
outside temperature, (2) the number of
inches of insulation in the attic, and (3)
the age in years of the furnace.
To investigate, Salsberry’s research
department selected a random sample
of 20 recently sold homes. It determined
the cost to heat each home last January,
as well as the January outside
temperature in the region, the number of
inches of insulation in the attic, and the
age of the furnace.
Data Salsberry
14-4
Multiple Linear Regression – Excel Output
SUMMARY OUTPUT
See Excel instruction
in the textbook,
P 566, #2.
Regression Statistics
Multiple R
0.896755
R Square
0.80417
Adjusted R
Square
0.767452
Standard Error
51.04855
Observations
b1, b2, and b3,
a
20
ANOVA
df
Regression
Residual
Total
Intercept
Temp
Insul
Age
SS
3
16
19
171220.5
41695.28
212915.8
Coefficients
427.1938
-4.58266
-14.8309
6.101032
Standard
Error
59.60143
0.772319
4.754412
4.01212
MS
57073.49
2605.955
t Stat
7.167509
-5.93364
-3.11939
1.52065
F
21.90118
Significance F
6.56E-06
P-value
Lower 95%
Upper 95%
2.24E-06
300.8444
553.5432
2.1E-05
-6.21991
-2.94542
0.006606
-24.9098
-4.75196
0.147862
-2.40428
14.60635
14-5
Estimating the Multiple Regression Equation
Interpreting the Regression Coefficients
The regression coefficient for mean outside temperature, X1, is -4.583. For
every unit increase in temperature, holding the other two independent
variables constant, monthly heating cost is expected to decrease by
$4.583.
The attic insulation variable, X2, also shows a negative relationship. For
each additional inch of insulation, the cost to heat the home is expected to
decline by $14.83 per month, .
The age of the furnace variable shows a positive relationship. For each
additional year older the furnace is, the cost is expected to increase by
$6.10 per month.
14-6
Using the Multiple Regression
Equation
Applying the Model for Estimation
What is the estimated heating cost for a home if the
mean outside temperature is 30 degrees, there are 5
inches of insulation in the attic, and the furnace is 10
years old?
Yˆ  427.194 4.583X 1  14.831X 2  6.101X 3
Yˆ  427.194 4.583(30)  14.831(5)  6.101(10)
Yˆ  276.56
Fitness of the model—Adjusted r2
The Adjusted R2
1.
2.
3.
4.
5.
R2
is inflated by the number of
independent variables.
In multiple regression analysis, the
adjusted R2 is a better measurement
of the fitness of the model.
Ranges from 0 to 1.
The Adjusted R2 is adjusted by the
number of independent variables and
sample size.
It measures the percentage of total
variation in Y that is explained by all
independent variables, that is,
explained by the regression model.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.896755
R Square
0.80417
Adjusted R
Square
0.767452
Standard Error
Observations
51.04855
20
About 76.7% of the
variation in the heating
cost is explained by
the mean outside
temperature, attic
insulation and the age
of furnace.
14-8
Global Test: Testing the Multiple Regression
Model
The global test is used to investigate whether any of the
independent variables have coefficients that are
significantly different from zero. This is also the test on the
validity of the model.
H 0 : 1   2  ...   k  0 (The model is invalid)
hypotheses
are:0 (The model is valid)
HThe
all  s equal
1 : Not
Decision Rules:
(1) Reject H0 if F > F,k,n-k-1
or
(2) Reject H0 if p-value<α
14
F-distribution




The distribution takes nonnegative values only.
Asymmetric, skewed to the right.
The shape of the distribution is controlled by 2 degrees
of freedoms, denoted v1 and v2.
The degrees of freedoms are usually reported in the
ANOVA table in the output
Excel function:
=FINV(α, k, n-k-1)
ANOVA
Regression
Residual
Total
df
3
16
19
SS
171220.5
41695.28
212915.8
MS
57073.49
2605.955
F
Significance F
21.90118
6.56E-06
Global test—Example
ANOVA
df
Regression
Residual
Total
SS
3
16
19
171220.5
41695.28
212915.8
MS
57073.49
2605.955
F
21.90118
Significance F
6.56E-06
1. H 0 : 1   2  3  0
H1 : Not all  s equal 0
2. Significance level: α=0.05
3. Test statistic: F=21.90
14-11
Global test—Example
ANOVA
df
Regression
Residual
Total
SS
3
16
19
171220.5
41695.28
212915.8
MS
57073.49
2605.955
F
21.90118
Significance F
6.56E-06
4. Rule (1) Rejection region:
Reject H0 if F >3.24
According to step 3, F=21.90,
which falls in the rejection region.
Rule (2) Reject H0 if p-value < α
p-value =0.00, less than 0.05
5. Decision: rejection the null hypothesis
=FINV(.05, 3, 16) = 3.24
14-12
Interpretation


The null hypothesis that all the multiple regression
coefficients are zero is rejected.
Interpretation:




Some of the independent variables are useful in predicting the
dependent variable (heating cost).
Some of the independent variables are linearly related to the
dependent variable.
The model is valid.
Logical question – which ones?
14-13
Evaluating Individual Regression Coefficients (βi)





This test is used to determine which independent variables have nonzero
regression coefficients.
The variables that have nonzero regression coefficients are said to have
significant coefficients (significantly different from zero).
The variables that have zero regression coefficients can be dropped from
the analysis.
The test statistic follows t distribution.
The test hypotheses test are:
H0: βi = 0
H1: βi ≠ 0

Instead of comparing test statistic with rejection region for each
independent variable (which is tedious), we rely on the p-values. If p-value
< α, we reject the null hypothesis.
14-14
P-values for the Slopes
For temperature:
H0: β1 = 0
H1: β1 ≠ 0
P-value=.00 < .05
For Insulation:
H0: β2 = 0
H1: β2 ≠ 0
P-value=.007 < .05
For furnace age:
H0: β3 = 0
H1: β3 ≠ 0
P-value=.148 < .05
Conclusions:
For temperature and insulation, rejection the null hypothesis.
(1)The coefficients are significant (significantly different from zero);
(2)The variables are linearly related to heating cost
(3)The variables are useful in predicting heating cost
For furnace age, do not rejection the null hypothesis.
(1)the coefficient is insignificant and thus can be dropped from the model
(2)The variable is not linearly related to heating cost
(3)The variable is not useful in predicting heating cost
Intercept
Temp
Insul
Age
Coefficients
427.1938
-4.58266
-14.8309
6.101032
Standard
Error
59.60143
0.772319
4.754412
4.01212
t Stat
7.167509
-5.93364
-3.11939
1.52065
P-value
Lower 95%
Upper 95%
2.24E-06
300.8444
553.5432
2.1E-05
-6.21991
-2.94542
0.006606
-24.9098
-4.75196
0.147862
-2.40428
14.60635
14-15
New Regression without Variable “Age”
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.880834
R Square
0.775868
Adjusted R
Square
0.7495
Standard
Error
52.98237
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Temp
Insul
SS
2 165194.5
17 47721.23
19 212915.8
Standard
Coefficients Error
490.2859 44.40984
-5.14988 0.701887
-14.7181 4.933918
MS
82597.26
2807.131
t Stat
11.04003
-7.3372
-2.98305
F
29.42408
Significance F
3.01E-06
Upper
P-value
Lower 95%
95%
3.56E-09
396.5893 583.9825
1.16E-06
-6.63074 -3.66903
0.008351
-25.1278 -4.30849
Lower
95.0%
396.5893
-6.63074
-25.1278
Upper
95.0%
583.9825
-3.66903
-4.30849
14-16
New Regression Model without
Variable “Age”
ANOVA
Regression
Residual
Total
df
SS
2
17
19
165194.5
47721.23
212915.8
MS
F
82597.26
2807.131
29.42408
Significance F
3.01E-06
1. H 0 : 1   2  ...   k  0
H1 : Not all  s equal 0
d.f. (2,17)
3.59
2. Significance level: α=0.05
3. Test statistic: F=29.42
4. Rejection region:
Reject H0 if F >3.59, test statistic
falls in the rejection region.
p-value =0.00, less than 0.05
5. Decision: rejection the null hypothesis
Individual t-test on the new
Coefficient
For temperature:
H0: β1 = 0
H1: β1 ≠ 0
P-value=.00 < .05
For Insulation:
H0: β2 = 0
H1: β2 ≠ 0
P-value=.008 < .05
Conclusions:
For temperature and insulation, rejection the null hypothesis.
(1)The coefficients are significant (significantly different from zero);
(2)The variables are linearly related to heating cost
(3)The variables are useful in predicting heating cost
Intercept
Temp
Insul
Standard
Coefficients Error
490.2859 44.40984
-5.14988 0.701887
-14.7181 4.933918
t Stat
11.04003
-7.3372
-2.98305
Upper
P-value
Lower 95%
95%
3.56E-09
396.5893 583.9825
1.16E-06
-6.63074 -3.66903
0.008351
-25.1278 -4.30849
Lower
95.0%
396.5893
-6.63074
-25.1278
Upper
95.0%
583.9825
-3.66903
-4.30849
14-18
Multiple Regression Assumptions
I.
II.
III.
IV.
V.
Each of the independent variables and the
dependent variable have a linear relationship.
The independent variables are not correlated. When
this assumption is violated, we call the condition
multicollinearity.
The probability distribution of ε is normal.
The variance of ε is constant regardless of the value
of Yˆ. This condition is called homoscedasticity. When
the requirement is violated, we say heterscedasticity
is observed in the regression.
The error terms are independent of each other. This
assumption is often violated when time is involved
and we call this condition autocorrelation.
14-19
Evaluating the Assumptions of Multiple
Regression
There is a linear relationship. We use scatter plot to examine
this assumption.
II. The independent variables are not correlated. We examine
the correlation coefficient among the independent variables.
III. The error term follow the normal probability distribution.
We use histogram of the residual or normal probability plot to
examine the normality.
IV. The variance of ε is constant regardless of the value of Yˆ .
V. The error term independent of each other.
We plot the residual against the predicted Y to examine the last
two assumptions.
I.
14-20
Assumption I: linear relationship
A scatter plot of each independent variable against the dependent variable is used.
Temp
Insul
70
14
60
12
50
10
40
8
Temp
30
Insul
6
4
20
2
10
0
0
0
0
100
200
300
400
5
10
15
20
25
500
Age
In practice, we can skip this
check since the test on
individual coefficient will serve
the same purpose.
16
14
12
10
8
6
4
2
0
Age
0
100
200
300
400
500
14-21
Assumption II: Multicollinearity

Multicollinearity exists when independent variables (X’s) are correlated.

Effects of Multicollinearity on the Model:
1. An independent variable known to be an important predictor ends up
having an insignificant coefficient.
2. A regression coefficient that should have a positive sign turns out to be
negative, or vice versa.
3. Multicollieanrity adds difficulty to the interpretation of the coefficients.
When one variable changes by 1 unit, other correlated variables will change
also (but we require it to be held constant in order to correctly interpret the
coefficient).
However, correlated independent variables do not affect a multiple
regression equation’s ability to predict the dependent variable (Y).
Minimizing the effect of multicollinearity is often easier than correcting it:
1. Try to include explanatory variables that are independent of each other.
2. Remove variables that cause multicollinearity in the model.


14-22
Multicollinearity: Detection




A general rule is if the correlation between two independent
variables is between -0.70 and 0.70 there likely is not a problem
using both of the independent variables.
A more precise test is to use the variance inflation factor (VIF).
A VIF > 10 is unsatisfactory. Remove that independent variable
from the analysis.
The value of VIF is found as follows:
1
VIF 
1  R 2j
The term R2j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
14-23
Multicollinearity – Example
Refer to example of
heating cost, which is
related to the
independent variables
outside temperature,
amount of insulation, and
age of furnace.
Develop a correlation
matrix for all the
independent variables.
Does it appear there is a
problem with
multicollinearity?
Correlation Matrix
Temp
Insul
Temp
1.00
Insul
-0.10
1.00
Age
-0.49
0.06
Age
1.00
None of the correlations (highlighted
above) among the independent variables
exceed -.7u0 or .70, so we do not
suspect problems with multicollinearity.
Excel: Data-> Data Analysis-> Correlation
14-24
VIF – Example
SUMMARY OUTPUT
Find and interpret the variance
inflation factor for each of the
independent variables.
We consider variable temperature
first. We run a multiple regression
with temperature as the
dependent variable and the other
two as the independent variables.
Regression Statistics
Multiple R
R Square
0.491328
0.241403
Adjusted R
Square
0.152157
Standard
Error
16.03105
Observations
Intercept
Insul
Age
Coefficient of
Determination
20
Coefficients
57.99449
-0.50888
-2.50902
Standard
Error
12.34827
1.487944
1.103252
t Stat
4.696567
-0.342
-2.2742
P-value
0.000208
0.736541
0.036201
The VIF value of 1.32 is less than the upper limit of 10. This indicates
that the independent variable temperature is not strongly correlated with
the other independent variables.
14-25
VIF – Example
Calculating the VIF for each variable using Excel can be tedious.
Minitab generates the VIF values for each independent variable in
its output, which is shown below.
None of the VIFs are higher than 10. hence, we conclude there is not a problem
with multicollinearity in this example.
Note: for your project first obtain correlation matrix. For variables that are associated with
correlation coefficients exceeding -.70 or .70, calculated the corresponding VIFs to further
determine whether multicollinearity is an issue or not.
14-26
Assumption III: Normality of
Error Term
Histogram (discuss in review) of residuals is used to visually
determine whether the assumption of normality is satisfied.
Excel offers another graph, normal probability plot, that helps to
evaluate this assumption. Basically, if the plotted points are fairly
close to a straight line drawn from the lower left to the upper right,
the normality assumption is satisfied.
Normal Probability Plot
500
Cost
400
300
200
100
0
0
20
40
60
80
Sample Percentile
100
120
14-27
Assumption IV & V
Residuals
100
80
60
Residuals
40
20
0
-20 0
100
200
-40
-60
-80
-100
Predicted Cost
300
400
As we can see from the
scatter plot, the residuals
are randomly distributed
across the horizontal
axis and there is no
obvious. Therefore, there
is no sign of
heteroscedasticity or
autocorrelation.
14-28
Residual Plot versus Fitted Values:
Testing the Heteroscedasticity Assumption

When the variance of the error
term is changing across different
values of Y’s, we refer to this
condition as heteroscedasticity.
Residuals
6
4
In the plot of the residuals against
the predicted value of Y, we look
for a change in the spread of the
plotted points.
2
Residuals

-5
0
-2
0
5
10
15
-4
-6
-8

The spread of the points increases
as the predicted value of Y
increases. A scatter plot such
-10
Predicted Y
as this would indicate possible
heteroscedasticity.
14-29
Residual Plot versus Fitted Values:
Testing the Independence Assumption

When successive residuals are
correlated we refer to this
condition as autocorrelation,
which frequently occurs when
the data are collected over a
period of time.

Note the run of residuals above
the mean of the residuals,
followed by a run below the
mean. A scatter plot such as
this would indicate possible
autocorrelation.
14-30
Dummy Variable
Usually categorical data or nominal data cannot be included in the
analysis directly. Instead, we need to use dummy variables to denote
the categories.
Dummy variable: Dummy variable is a variable that can assume
either one of only two values (usually 1 and 0), where 1 represents the
existence of a certain condition and 0 indicates that the condition does
not hold.
Notation:
1, condition holds
I 
0, otherw ise
Dummy Variable - Example
Suppose in the Salsberry Realty
example that the independent variable
“garage” is added, which indicate
whether a house comes with an
attached garage or not. To include this
variable in our analysis, we define a
dummy variable as follows: for those
homes without an attached garage, 0
is used; for homes with an attached
garage, a 1 is used.
1, w ithattached garage
I 
0, otherw ise
14-32
Dummy Variable - Example
SUMMARY OUTPUT
New estimated regression equation:
Regression Statistics
Multiple R
0.932651
R Square
0.869838
Adjusted R
Square
0.845433
Standard Error
Observations
Yˆ  a  b1 X 1  b2 X 2  b3 I
Yˆ  394 3.96X 1  11.3 X 2  77.4 I
41.61842
20
ANOVA
df
Regression
Residual
Total
Intercept
Temp
Insul
Garage
SS
3
16
19
MS
185202.3
27713.48
212915.8
61734.09
1732.093
Coefficients Standard Error
393.6657
45.00128
-3.96285
0.652657
-11.334
4.001531
77.4321
22.78282
t Stat
8.747876
-6.07186
-2.8324
3.398706
F
35.64133
Significance F
2.59E-07
P-value
Lower 95%
Upper 95%
1.71E-07
298.2672
489.0641
1.62E-05
-5.34642
-2.57928
0.01201
-19.8168
-2.85109
0.00367
29.13468
125.7295
14-33
Dummy Variable - Example
Yˆ  a  b1 X 1  b2 X 2  b3 I
Yˆ  393.67  3.96X 1  11.3 X 2  77.4 I
Interpretation:
b3 = 77.4: the heating cost for homes with attached
garage is on average $77.4 higher than homes
without attached garage, with other conditions
being the same.
14-34
Dummy Variable – Another
Example
What determines the value of a used
car? To examine this issue, a used-car
dealer randomly selected 100 3-yearold Toyota Camrys that were sold at
auction during the past month. Each
car was in top condition and equipped
with all the features that come standard
with this car. The dealer recorded the
price ($1,000), the number of miles
(thousands) on the odometer and the
color of the car.
When recording the color, the dealer
uses 1 to denote white, 2 to denote
silver and 3 to denote other colors.
14-35
Dummy Variable – Another
Example




Although variable color include numbers, 1, 2, and 3, they
cannot be included in the analysis. Instead we need to
generate dummy variables to denote the different categories.
Rule of assigning dummy variables: if there are m different
categories in the data, generate m-1 dummy variables. The
last category is represented by I1 = I2 = … = Im-1 = 0 , and is
called the omitted category.
Since there are three categories in variable color, we generate
two dummy variables defined as follows:
1, silver
1, w hite
I1  
I2  
0
,
otherw
ise

0, otherw ise
“Other colors” is the omitted category and is represented by
I1 = I2 = 0
14-36
Dummy Variable – Excel






Open data Toyota Camry
In the column next to “color” type “I1” to generate the dummy
variable for “white.” In the cell below it, type =IF(C2=1, 1, 0)
and hit enter. (Excel function: IF(logical_test, [value_if_true],
[value_if_false].)
Copy the cell and paste to the rest of cells in the column till the
cell in the previous column is empty.
Similarly, generate the dummy variable for “silver” in the next
column by typing =IF(C2=2, 1, 0) and follow the same
procedure.
To run regression, we need to put the explanation variables
together. Copy the column of Odometer and past to the
column next to the second dummy variable.
Run multiple regression using the 2 dummy variables and
Odometer.
14-37
Dummy Variable – Excel
=IF(C2=1, 1, 0)
=IF(C2=2, 1, 0)
14-38
Dummy Variable – Excel
SUMMARY OUTPUT
Estimated regression equation:
Regression Statistics
Multiple R
0.837135
R Square
0.700794
Adjusted R
Square
0.691444
Standard
Error
0.304258
Observations
Yˆ  a  b1 I1  b2 I 2  b3 X
Yˆ  16.84  0.09I1  0.33I 2  0.59X
100
ANOVA
df
Regression
Residual
Total
Intercept
I1
I2
Odometer
SS
3
96
99
20.81492
8.886981
29.7019
Coefficients
16.83725
0.091131
0.330368
-0.05912
Standard
Error
0.197105
0.072892
0.08165
0.005065
MS
6.938306
0.092573
t Stat
85.42255
1.250224
4.046157
-11.6722
F
74.9498
Significance
F
4.65E-25
P-value
Lower 95% Upper 95% Lower 95.0% Upper 95.0%
2.28E-92
16.446
17.2285
16.446
17.2285
0.214257
-0.05356
0.235819
-0.05356
0.235819
0.000105
0.168294
0.492442
0.168294
0.492442
4.04E-20
-0.06918
-0.04907
-0.06918
-0.04907
14-39
Dummy Variable – Interpretation


The coefficient of I1 : b1 = 0.09:
A white Camry sells for .0911 thousand or $91.10 on average
more than other colors (nonwhite, nonsilver) with the same
odometer reading.
The coefficient of I2 : b2 = 0.3304:
A silver Camry sells for .3304 thousand or $33.04 on average
more than other colors (nonwhite, nonsilver) with the same
odometer reading.
14-40
Stepwise Regression
The advantages to the stepwise method are:
1. Only independent variables with significant regression
coefficients are entered into the equation.
2. The steps involved in building the regression equation are clear.
3. It is efficient in finding the regression equation with only
significant regression coefficients.
4. The changes in the multiple standard error of estimate and the
coefficient of determination are shown.
14-41
Stepwise Regression – Minitab Example
The stepwise MINITAB output for the heating cost
problem follows.
Temperature is
selected first. This
variable explains
more of the
variation in heating
cost than any other
proposed
independent
variables.
Garage is selected
next, followed by
Insulation.
Variable age is not
selected
14-42