Statistics Refresher - Multiple Regression

advertisement
Chapter 18
Multiple Regression
18.1 Introduction
• In this chapter we extend the simple linear
regression model, and allow for any number of
independent variables.
• We expect to build a model that fits the data
better than the simple linear regression model.
• We will use computer printout to
– Assess the model
• How well it fits the data
• Is it useful
• Are any required conditions violated?
– Employ the model
• Interpreting the coefficients
• Predictions using the prediction equation
• Estimating the expected value of the dependent variable
18.2 Model and Required Conditions
• We allow for k independent variables to
potentially be related to the dependent variable
Coefficients
Random error variable
y = b0 + b1x1+ b2x2 + …+ bkxk + e
Dependent variable
Independent variables
y
The simple linear regression model
allows for one independent variable, “x”
y =b0 + b1x + e
Note how the straight line
becomes a plain, and...
X1
The multiple linear regression model
allows for more than one independent variable.
Y = b0 + b1x1 + b2x2 + e
X2
y
y= b0+ b1x2
b0
X1
y = b0 + b1x12 + b2x2
… a parabola becomes a parabolic surface
X2
• Required conditions for the error variable e
– The error e is normally distributed with mean equal
to zero and a constant standard deviation se
(independent of the value of y). se is unknown.
– The errors are independent.
• These conditions are required in order to
– estimate the model coefficients,
– assess the resulting model.
18.3 Estimating the Coefficients
and Assessing the Model
• The procedure
– Obtain the model coefficients and statistics using a
statistical computer software.
– Diagnose violations of required conditions. Try to
remedy problems when identified.
– Assess the model fit and usefulness using the model
statistics.
– If the model passes the assessment tests, use it to
interpret the coefficients and generate predictions.
Example 18.1 Where to locate a new motor inn?
– La Quinta Motor Inns is planning an expansion.
– Management wishes to predict which sites are likely to
be profitable.
– Several areas where predictors of profitability can be
identified are:
•
•
•
•
•
Competition
Market awareness
Demand generators
Demographics
Physical quality
Profitability
Competition
Rooms
Number of
hotels/motels
rooms within
3 miles from
the site.
Market
awareness
Nearest
Distance to
the nearest
La Quinta inn.
Customers
Office
space
College
enrollment
Margin
Community
Physical
Income
Disttwn
Median
household
income.
Distance to
downtown.
– Data was collected from randomly selected 100 inns
that belong to La Quinta, and ran for the following
suggested model:
Margin =b0 + b1Rooms + b2Nearest + b3Office + b4College
+ b5Income + b6Disttwn +
INN
1
2
3
4
5
6
e
MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN
55.5
3203
0.1
549
8
37
12.1
33.8
2810
1.5
496
17.5
39
0.4
49
2890
1.9
254
20
39
12.2
31.9
3422
1
434
15.5
36
2.7
57.4
2687
3.4
678
15.5
32
7.9
49
3759
1.4
635
19
41
4
• Excel output
SUMMARY OUTPUT
This is the sample regression equation
(sometimes called the prediction equation)
MARGIN = 72.455 - 0.008ROOMS -1.646NEAREST
+ 0.02OFFICE +0.212COLLEGE
- 0.413INCOME + 0.225DISTTWN
Regression Statistics
Multiple R 0.724611
R Square 0.525062
Adjusted R Square
0.49442
Standard Error
5.512084
Observations
100
Let us assess this equation
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
6 3123.832 520.6387 17.13581 3.03E-13
93 2825.626 30.38307
99 5949.458
Coefficients
Standard Error t Stat
Intercept
72.45461 7.893104 9.179483
ROOMS
-0.00762 0.001255 -6.06871
NEAREST -1.64624 0.632837 -2.60136
OFFICE
0.019766
0.00341 5.795594
COLLEGE 0.211783 0.133428 1.587246
INCOME
-0.41312 0.139552 -2.96034
DISTTWN 0.225258 0.178709 1.260475
P-value Lower 95% Upper 95%
1.11E-14 56.78049 88.12874
2.77E-08 -0.01011 -0.00513
0.010803 -2.90292 -0.38955
9.24E-08 0.012993 0.026538
0.115851 -0.05318 0.476744
0.003899 -0.69025
-0.136
0.210651 -0.12962 0.580138
• Standard error of estimate
– We need to estimate the standard error of estimate
SSE
se 
n  k 1
– Compare se to the mean value of y
• From the printout, Standard Error = 5.5121
• Calculating the mean value of y we have y  45.739
– It seems se is not particularly small.
– Can we conclude the model does not fit the data
well?
• Coefficient of determination
– The definition is
R2  1 

SSE
(y i  y )2
– From the printout, R2 = 0.5251
– 52.51% of the variation in the measure of profitability is
explained by the linear regression model formulated
above.
– When adjusted for degrees of freedom,
Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)] =
= 49.44%
• Testing the validity of the model
– We pose the question:
Is there at least one independent variable linearly
related to the dependent variable?
– To answer the question we test the hypothesis
H0: b1 = b2 = … = bk = 0
H1: At least one bi is not equal to zero.
– If at least one bi is not equal to zero, the model is
valid.
• To test these hypotheses we perform an analysis
of variance procedure.
• The F test
– Construct the F statistic
[Variation in y] = SSR + SSE.
Large F results from a large SSR.
Then, much of the variation in y is
explained by the regression model.
The null hypothesis should
be rejected;
the modelregion
is valid.
– thus,
Rejection
MSR=SSR/k
MSR
F
MSE
MSE=SSE/(n-k-1)
F>Fa,k,n-k-1
Required conditions must
be satisfied.
Example 18.1 - continued
• Excel provides the following ANOVA results
MSR/MSE
ANOVA
df
Regression
Residual
Total
SSE
SSR
SS
MS
F
Significance F
6 3123.832 520.6387 17.13581 3.03382E-13
93 2825.626 30.38307
99 5949.458
MSE
MSR
Example 18.1 - continued
• Excel provides the following ANOVA results
ANOVA Conclusion: There is sufficient evidence to reject
the nulldfhypothesisSS
in favor of the
MS alternativeFhypothesis.
Significance F
At least one6of the
bi is not equal
to zero.17.13581
Thus, at least
Regression
3123.832
520.6387
3.03382E-13
variable is30.38307
linearly related to y.
Residual one independent
93 2825.626
This linear
regression model is valid
Total
99 5949.458
Fa,k,n-k-1 = F0.05,6,100-6-1=2.17
F = 17.14 > 2.17
Also, the p-value (Significance F) = 3.03382(10)-13
Clearly, a = 0.05>3.03382(10)-13, and the null hypothesis
is rejected.
• Let us interpret the coefficients
–
This is the intercept, the value of y when all
the variables take the value zero. Since the data
range of all the independent variables do not cover
the value zero, do not interpret the intercept.
–
In this model, for each additional 1000
rooms within 3 mile of the La Quinta inn, the
operating margin decreases on the average by 7.6%
(assuming the other variables are held constant).
b0  72.5
b1  .0076
–
b2  1.65
–
b 4  .21
In this model, for each additional mile that
the nearest competitor is to La Quinta inn, the average
operating margin decreases by 1.65%
– b3  .02 For each additional 1000 sq-ft of office space,
the average increase in operating margin will be .02%.
For additional thousand students MARGIN
increases by .21%.
– b5  .41 For additional $1000 increase in median
household income, MARGIN decreases by .41%
–
b6  .23
For each additional mile to the downtown
center, MARGIN increases by .23% on the average
• Testing the coefficients
– The hypothesis for each bi
H0: bi = 0
H1: bi = 0
Test statistic
b  bi
t i
sb i
– Excel printout
Coefficients
Standard Error t Stat
Intercept
72.45461 7.893104 9.179483
ROOMS
-0.00762 0.001255 -6.06871
NEAREST -1.64624 0.632837 -2.60136
OFFICE
0.019766 0.00341 5.795594
COLLEGE 0.211783 0.133428 1.587246
INCOME
-0.41312 0.139552 -2.96034
DISTTWN 0.225258 0.178709 1.260475
P-value
1.11E-14
2.77E-08
0.010803
9.24E-08
0.115851
0.003899
0.210651
d.f. = n - k -1
Lower 95% Upper 95%
56.78048735 88.12874
-0.010110582 -0.00513
-2.902924523 -0.38955
0.012993085 0.026538
-0.053178229 0.476744
-0.690245235
-0.136
-0.12962198 0.580138
• Using the linear regression equation
– The model can be used by
• Producing a prediction interval for the particular value of y,
for a given set of values of xi.
• Producing an interval estimate for the expected value of y,
for a given set of values of xi.
– The model can be used to learn about relationships
between the independent variables xi, and the
dependent variable y, by interpreting the coefficients bi
• Example 18.1 - continued. Produce predictions
– Predict the MARGIN of an inn at a site with the
following characteristics:
•
•
•
•
•
•
3815 rooms within 3 miles,
Closet competitor 3.4 miles away,
476,000 sq-ft of office space,
24,500 college students,
$39,000 median household income,
3.6 miles distance to downtown center.
MARGIN = 72.455 - 0.008(3815) -1.646(3.4) + 0.02(476)
+0.212(24.5) - 0.413(39) + 0.225(3.6) = 37.1%
18.4 Regression Diagnostics - II
• The required conditions for the model
assessment to apply must be checked.
– Is the error variable normally
Draw a histogram of the residuals
distributed?
– Is the error variance constant? Plot the residuals versus y^
– Are the errors independent?
Plot the residuals versus the time periods
– Can we identify outliers?
– Is multicollinearity a problem?
• Example 18.2 House price and multicollinearity
– A real estate agent believes that a house selling
price can be predicted using the house size, number
of bedrooms, and lot size.
– A random sample of 100 houses was drawn and
data recorded.
Price
124100
218300
117800
.
.
Bedrooms
3
4
3
.
.
H Size
1290
2080
1250
.
.
Lot Size
3900
6600
3750
.
.
– Analyze the relationship among the four variables
• Solution
• The proposed model is
PRICE = b0 + b1BEDROOMS + b2H-SIZE +b3LOTSIZE + e
– Excel solution
Regression Statistics
Multiple R
0.74833
R Square 0.559998
Adjusted R Square
0.546248
Standard Error
25022.71
Observations
100
The model is valid, but no
variable is significantly related
to the selling price !!
ANOVA
df
Regression
Residual
Total
SS
MS
3 7.65E+10 2.55E+10
96 6.01E+10 6.26E+08
99 1.37E+11
Coefficients
Standard Error t Stat
Intercept
37717.59 14176.74 2.660526
Bedrooms 2306.081 6994.192 0.329714
H Size
74.29681 52.97858 1.402393
Lot Size
-4.36378
17.024 -0.25633
F
Significance F
40.7269 4.57E-17
P-value Lower 95% Upper 95%
0.009145 9576.963 65858.23
0.742335 -11577.3 16189.45
0.164023 -30.8649 179.4585
0.798244 -38.1562 29.42862
• However,
– when regressing the price on each independent
variable alone, it is found that each variable is
strongly related to the selling price.
– Multicollinearity is the source of this problem.
Price
Bedrooms H Size
Price
1
Bedrooms 0.645411
1
H Size
0.747762 0.846454
1
Lot Size
0.740874 0.83743 0.993615
Lot Size
1
• Multicollinearity causes two kinds of difficulties:
– The t statistics appear to be too small.
– The b coefficients cannot be interpreted as “slopes”.
• Remedying violations of the required conditions
– Nonnormality or heteroscedasticity can be
remedied using transformations on the y variable.
– The transformations can improve the linear
relationship between the dependent variable and the
independent variables.
– Many computer software systems allow us to make
the transformations easily.
• A brief list of transformations
» y’ = log y (for y > 0)
• Use when the se increases with y, or
• Use when the error distribution is positively skewed
» y’ = y2
• Use when the s2e is proportional to E(y), or
• Use when the error distribution is negatively skewed
» y’ = y1/2 (for y > 0)
• Use when the s2e is proportional to E(y)
» y’ = 1/y
• Use when s2e increases significantly when y increases
beyond some value.
• Example 18.3: Analysis, diagnostics, transformations.
– A statistics professor wanted to know whether time limit
affect the marks on a quiz?
– A random sample of 100 students was split into 5 groups.
– Each student wrote a quiz, but each group was given a
different time limit. See data below.
Time
M
a
r
k
s
40
45
50
55
60
20
24
26
30
32
23
26
25
32
31
.
Analyze
.
.
these.
results,
.
and
.
.
.
include .diagnostics.
50
40
The model tested:
MARK = b0 + b1TIME + e
30
20
10
0
-2.5
-1.5
-0.5
0.5
1.5
2.5
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.86254
R Square
0.743974
Adjusted R Square 0.741362
Standard Error
2.304609
Observations
100
This model is useful and
provides a good fit.
ANOVA
df
Regression
Residual
Total
Intercept
Time
1
98
99
SS
MS
F
Significance F
1512.5
1512.5 284.7743 9.42E-31
520.5 5.311224
2033
Coefficients
Standard Error t Stat
P-value Lower 95% Upper 95%
-2.2 1.64582 -1.33672 0.184409 -5.46608 1.066077
0.55 0.032592 16.87526 9.42E-31 0.485322 0.614678
The errors seem to be
normally distributed
More
Standardized errors vs. predicted mark.
4
3
2
1
0
-1 20
22
24
26
28
30
32
-2
-3
The standard error of estimate seems to increase
with the predicted value of y.
Two transformations are used to remedy this problem:
1. y’ = logey
2. y’ = 1/y
Let us see what happens when a transformation is applied
Mark
40
The original data, where “Mark”
is a function of “Time”
LogMark
35
4
30
The modified data, where
LogMark is a function of “Time"
Loge23 = 3.135
25
40,23
3
40, 3.135
40, 2.89
20
Loge18 = 2.89
40,18
15
2
0
20
40
60
80
0
20
40
60
80
The new regression analysis and the diagnostics are:
The model tested:
LOGMARK = b’0 + b’1TIME + e’
Predicted LogMark = 2.1295 + .0217Time
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8783
R Square 0.771412
Adjusted R Square
0.769079
Standard Error
0.084437
Observations
100
This model is useful and
provides a good fit.
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
1 2.357901 2.357901 330.7181 3.58E-33
98 0.698705
0.00713
99 3.056606
Coefficients
Standard Error t Stat
Intercept
2.129582
0.0603 35.31632
Time
0.021716 0.001194 18.18566
P-value Lower 95% Upper 95%
1.51E-57 2.009918 2.249246
3.58E-33 0.019346 0.024086
40
30
The errors seem to be
normally distributed
20
10
0
-2.5
-1.5
-0.5
0.5
1.5
2.5
More
Standard Residuals
The standard errors still changes
with the predicted y, but the change
is smaller than before.
4
2
0
-2 2.9
-4
3
3.1
3.2
3.3
3.4
3.5
Let TIME = 55 minutes
LogMark = 2.1295 + .0217Time = 2.1295 + .0217(55) = 3.323
How do we use the modified model to predict?
To find the predicted mark, take the antilog:
antiloge3.323 = e3.323 = 27.743
18.5 Regression Diagnostics - III
• The Durbin - Watson Test
– This test detects first order auto-correlation between
consecutive residuals in a time series
– If autocorrelation exists the error variables are not
independent
n
Residual at time i
d

(ri  ri1 ) 2
i2
n

ri 2
i 1
The range of d is 0  d  4
Positive first order autocorrelation occurs when
consecutive residuals tend to be similar. Then,
Residuals the value of d is small (less than 2).
Positive first order autocorrelation
+
+
+
+
0
+
+
Time
+ +
Negative first order autocorrelation
Residuals
+
+
Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
+
+
+
+
+
0
Time
• One tail test for positive first order auto-correlation
– If d<dL there is enough evidence to show that positive
first-order correlation exists
– If d>dU there is not enough evidence to show that positive
first-order correlation exists
– If d is between dL and dU the test is inconclusive.
• One tail test for negative first order auto-correlation
– If d>4-dL, negative first order correlation exists
– If d<4-dU, negative first order correlation does not exists
– if d falls between 4-dU and 4-dL the test is inconclusive.
• Two-tail test for first order auto-correlation
– If d<dL or d>4-dL first order auto-correlation exists
– If d falls between dL and dU or between 4-dU and 4-dL
the test is inconclusive
– If d falls between dU and 4-dU there is no evidence
for first order auto-correlation
First order
correlation
exists
0
dL
First order
correlation
does not
exist
Inconclusive
test
dU
2
First order
correlation
does not
exist
Inconclusive
test
4-dU
First order
correlation
exists
4-dL
4
• Example 18.4
– How does the weather affect the sales of lift tickets in a
ski resort?
– Data of the past 20 years sales of tickets, along with the
total snowfall and the average temperature during
Christmas week in each year, was collected.
– The model hypothesized was
TICKETS=b0+b1SNOWFALL+b2TEMPERATURE+e
– Regression analysis yielded the following results:
SUMMARY OUTPUT
The model seems to be very poor:
Regression Statistics
Multiple R
0.3464529
R Square
0.1200296
Adjusted R Square 0.0165037
Standard Error
1711.6764
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Snowfall
Tempture
• The fit is very low (R-square=0.12),
• It is not valid (Signif. F =0.33)
• No variable is linearly related to Sales
Diagnosis of the required
conditions
resulted
with
SS
MS
F
Signif. F
following
6793798.2 the
3396899.1
1.1594 findings
0.3372706
2
17 49807214 2929836.1
19 56601012
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
8308.0114 903.7285 9.1930391 5E-08 6401.3083 10214.715
74.593249 51.574829 1.4463111 0.1663 -34.22028 183.40678
-8.753738 19.704359 -0.444254 0.6625 -50.32636 32.818884
3000
2000
7 y
Residual vs. predicted
1000
0
-10007500
-2000
The errors may be
normally distributed
8500
9500
-3000
-4000
The error distribution
6
5
4
3
105002 11500
1
0
The error variance
is constant
12500
-2.5
-1.5
-0.5
0.5
1.5
2.5
More
Residual over time
The errors are
not independent
3000
2000
1000
0
-1000 0
-2000
-3000
-4000
5
10
15
20
25
Test for positive first order auto-correlation:
n=20, k=2. From the Durbin-Watson table we have:
dL=1.10, dU=1.54. The statistic d=0.59
Conclusion: Because d<dL , there is sufficient evidence
to infer that positive first order auto-correlation exists.
Using the computer - Excel
Tools > data Analysis > Regression (check the residual option and then OK)
Tools > Data Analysis Plus > Durbin Watson Statistic > Highlight the range of the residuals
from the regression run > OK
Durbin-Watson Statistic
-2793.99
-1723.23
d = 0.5931
-2342.03
-956.955
The residuals
-1963.73
.
.
Residuals
4000
2000
0
-2000 0
-4000
5
10
15
20
25
The modified
regression model
The autocorrelation
has occurred over time.
Therefore, a time dependent variable added
TICKETS=b
b2TEMPERATURE+
b3YEARS+e
to the model
may correct the
problem
0+ b1SNOWFALL+
• All the required conditions are met for this model.
• The fit of this model is high R2 = 0.74.
• The model is useful. Significance F = 5.93 E-5.
• SNOWFALL and YEARS are linearly related to ticket sales.
• TEMPERATURE is not linearly related to ticket sales.
Download