Uploaded by Jon Gomez

render qam13e ism ch04

advertisement
CHAPTER
4
Regression Models
TEACHING SUGGESTIONS
Teaching Suggestion 4.1: Which Is the Independent Variable?
We find that students are often confused about which variable is independent and which is
dependent in a regression model. For example, in Triple A’s problem, clarify which variable is X
and which is Y. Emphasize that the dependent variable (Y) is what we are trying to predict based
on the value of the independent (X) variable. Use examples such as the time required to drive to a
store and the distance traveled, the totals number of units sold and the selling price of a product,
and the cost of a computer and the processor speed.
Teaching Suggestion 4.2: Statistical Correlation Does Not Always Mean Causality.
Students should understand that a high r2 doesn’t always mean one variable will be a good
predictor of the other. Explain that skirt lengths and stock market prices may be correlated, but
raising one doesn’t necessarily mean the other will go up or down. An interesting study indicated
that, over a 10-year period, the salaries of college professors were highly correlated to the dollar
sales volume of alcoholic beverages (both were actually correlated with inflation).
Teaching Suggestion 4.3: Give students a set of data and have them plot the data and manually
draw a line through the data. A discussion of which line is “best” can help them appreciate the
least squares criterion.
Teaching Suggestion 4.4: Select some randomly generated values for X and Y (you can use
random numbers from the random number table in Chapter 13 or use the RAND function in
Excel). Develop a regression line using Excel and discuss the coefficient of determination and
the F-test. Students will see that a regression line can always be developed, but it may not
necessarily be useful.
Teaching Suggestion 4.5: A discussion of the long formulas and short-cut formulas that are
provided in the appendix is helpful. The long formulas provide students with a better
understanding of the meaning of the SSE and SST. Since many people use computers for
regression problems, it helps to see the original formulas. The short-cut formulas are helpful if
students are performing the computations on a calculator.
4-1
Copyright © 2018 Pearson Education, Inc.
ALTERNATIVE EXAMPLES
Alternative Example 4.1: The sales manager of a large apartment rental complex feels the
demand for apartments may be related to the number of newspaper ads placed during the
previous month. She has collected the data shown in the accompanying table.
Ads purchased, (X)
Apartments leased, (Y)
15
6
9
4
40
16
20
6
25
13
25
9
15
10
35
16
We can find a mathematical equation by using the least squares regression approach.
(Note: Round-off error may cause this to be slightly different from a calculator solution.)
Ads, X
(X – X )2
(X – X )(Y – Y )
6
15
64
32
4
9
196
84
16
40
289
102
6
20
9
12
13
25
4
6
9
25
4
–2
10
15
64
0
16
35
144
72
Leases, Y
ΣY = 80
Y=
ΣX = 184
Σ(X – X )2 = 774
Σ(X – X )(Y – Y ) = 306
80
184
= 10; X =
= 23
8
8
b1 = 306/774 = 0.395
b0 = 10 – 0.395(23) = 0.915
4-2
Copyright © 2018 Pearson Education, Inc.
The estimated regression equation is
Ŷ = 0.915 + 0.395X
or
Apartments leased = 0.915 + 0.395 ads placed
If the number of ads is 30, we can estimate the number of apartments leased with the regression
equation
0.915 + 0.395(30) = 12.76 or 13 apartments
Alternative Example 4.2: Given the data on ads and apartment rentals in Alternative Example
4.1, find the coefficient of determination. The following have been computed in the table that
follows:
SST = 150; SSE = 29.02; SSR = 120.76
(Note: Round-off error may cause this to be slightly different from a computer solution.)
X
(Y – Ῡ)2
Ŷ = 0.915 + 0.395X
(Y – Ŷ )2
( Ŷ –Ῡ )2
6.00
15.00
16
6.84
0.706
9.986
4.00
9.00
36
4.47
0.221
30.581
16.00
40.00
36
16.715
0.511
45.091
6.00
20.00
16
8.815
7.924
1.404
13.00
25.00
9
10.79
4.884
0.624
9.00
25.00
1
10.79
3.204
0.624
10.00
15.00
0
6.84
9.986
9.986
16.00
35.00
36
14.74
1.588
22.468
80.00
184.00
SST=150.00
80.00
SSE=29.02
SSR=120.76
Y
From this the coefficient of determination is
r2 = SSR/SST = 120.76/150 = 0.81
Alternative Example 4.3: For Alternative Examples 4.1 and 4.2, dealing with ads, X, and
apartments leased, Y, compute the correlation coefficient.
Since r2 = 0.81 and the slope is positive (+0.395), the positive square root of 0.81 is the
correlation coefficient. r = 0.90.
4-3
Copyright © 2018 Pearson Education, Inc.
SOLUTIONS TO DISCUSSION QUESTIONS AND PROBLEMS
4-1. The term least-squares means that the regression line will minimize the sum of the squared
errors (SSE). No other line will give a lower SSE.
4-2. Dummy variables are used when a qualitative factor such as the gender of an individual
(male or female) is to be included in the model. Usually this is given a value of 1 when the
condition is met (e.g. person is male) and 0 otherwise. When there are more than two levels or
values for the qualitative factor, more than one dummy variable must be used. The number of
dummy variables is one less than the number of possible values or categories. For example, if
students are classified as freshmen, sophomores, juniors and seniors, three dummy variables
would be necessary.
4-3. The coefficient of determination (r2) is the square of the coefficient of correlation (r). Both
of these give an indication of how well a regression model fits a particular set of data. An r2
value of 1 would indicate a perfect fit of the regression model to the points. This would also
mean that r would equal –1 or +1.
4-4. A scatter diagram is a plot of the data. This graphical image helps to determine if a linear
relationship is present, or if another type of relationship would be more appropriate.
4-5. The adjusted r2 value is used to help determine if a new variable should be added to a
regression model. Generally, if the adjusted r2 value increases when a new variable is added to a
model, this new variable should be included in the model. If the adjusted r2 value declines or
does not increase when a new variable is added, then the variable should not be added to the
model.
4-6. The F-test is used to determine if the overall regression model is helpful in predicting the
value of the independent variable (Y). If the F-value is large and the p-value or significance level
is low, then we can conclude that there is a linear relationship and the model is useful, as these
results would probably not occur by chance. If the significance level is high, then the model is
not useful and the results in the sample could be due to random variations.
4-7. The SSE is the sum of the squared errors in a regression model. SST = SSE + SSR.
4-8. When the residuals (errors) are plotted after a regression line is found, the errors should be
random and should not show any significant pattern. If a pattern does exist, then the assumptions
may not be met or another model (perhaps nonlinear) would be more appropriate.
4-9. a. Ŷ = 36 + 4.3(70) = 337
b. Ŷ = 36 + 4.3(80) = 380
c. Ŷ = 36 + 4.3(90) = 423
4-4
Copyright © 2018 Pearson Education, Inc.
4-10. a.
Y,, 80, 15
Y, 50, 13
Y, 70, 12
Y, 40, 11
1
Guitar sales
Y, 60
0, 10
Y 30, 8
Y,
Yo
ouTube viewss
4-10. b.
Guitar sales = Y
YouTube views = X
X
(X – X )2
8
30
6255
12.255
11
40
2255
12
70
10
Y
(Y – Y )2
(X – X )(Y – Y )
Ŷ
Y )2
(Y – Ŷ
( Ŷ – Y )2
87.5
9
1
6.225
0.255
7.5
10
1
2.225
2255
0.255
7.5
13
1
2.225
60
25
2.255
–7.5
12
4
0.225
15
80
6255
12.255
87.5
14
1
6.225
13
50
25
2.255
–
–7.5
11
4
0.225
ΣY = 69.0
ΣX = 330
17500
122
177.5
Y = 11.5
X = 55
SSE
E
SS
SR
29.5
175
SST
SST = 29.5; SSE = 12; SSR = 17.5
b1 = 175/1750
1
= 0.1
0
b0 = 11.5
1 – 0.1(55
5) = 6
The regreession equattion is Ŷ = 6 + 0.1X.
c. Ŷ = 6 + 0.1X = 6 + 0.1(40)) = 10.
4-5
Copyyright © 2018 Pearson Education, Inc.
4-11. See the table for the solution to problem 4-10 to obtain some of these numbers.
MSE = SSE/(n – k – 1) = 12/(6 – 1 – 1) = 3
MSR = SSR/k = 17.5/1 = 17.5
F = MSR/MSE = 17.5/3 = 5.83
df1 = k = 1
df2 = n – k – 1 = 6 – 1 – 1 = 4
F0.05, 1, 4 = 7.71
Do not reject H0 since 5.83 < 7.71. Therefore, we cannot conclude there is a statistically
significant relationship at the 0.05 level.
4-12. Using Excel, the regression equation is Ŷ = 6 + 0.1X. F = 5.83, the significance level is
0.073. This is significant at the 0.10 level (0.073 < 0.10), but it is not significant at the 0.05 level.
There is marginal evidence that there is a relationship between demand for sales and YouTube
views.
4-13.
(Y)
93
78
84
73
84
64
64
95
76
(X)
98
77
88
80
96
61
66
95
69
(X – X )2
285.235
16.901
47.457
1.235
221.679
404.457
228.346
192.901
146.679
711
730
1544.9
(Y – Y )2
196
1
25
36
25
225
225
256
9
998
(X – X )(Y – Y )
236.444
4.111
34.444
6.667
74.444
301.667
226.667
222.222
36.333
Y
91.5
76
84.1
78.2
90
64.1
67.8
89.3
70
1143
(Y – Ŷ )2
2.264
4.168
0.009
26.811
36.188
0.015
14.592
32.766
35.528
152.341
( Ŷ – Y )2
156.135
9.252
25.977
0.676
121.345
221.396
124.994
105.592
80.291
845.659
b1 = 1143/1544.9 = 0.74
b0 = (711/9) – 0.74 (730/9) = 18.99
a. Ŷ = 18.99 + 0.74X
b. Ŷ = 18.99 + 0.74(83) = 80.41
c. r2 = SSR/SST = 845.629/998 = 0.85; r = 0.92; this means that 85% of the variability in
the final average can be explained by the variability in the first test score.
4-6
Copyright © 2018 Pearson Education, Inc.
4-14. See the table for the solution to problem 4-13 to obtain some of these numbers.
MSE = SSE/(n – k – 1) = 152.341/(9 – 1 – 1) = 21.76
MSR = SSR/k = 845.659/1 = 845.659
F = MSR/MSE = 845.659/21.76 = 38.9
df1 = k = 1
df2 = n – k – 1 = 9 – 1 – 1 = 7
F0.05, 1, 7 = 5.59
Because 38.9 > 5.59, we can conclude (at the 0.05 level) that there is a statistically significant
relationship between the first test grade and the final average.
4-15. F = 38.86; the significance level = 0.0004 (which is extremely small) so there is definitely
a statistically significant relationship.
4-16. a. Ŷ = 33,478 + 62.4(1,860) = 149,542.
b. The predicted average selling price for a house this size would be $149,542. Some will
sell for more and some will sell for less. There are other factors besides size that influence
the price of the house.
c. Some other variables that might be included are age of the house, number of bedrooms,
and size of the lot. There are other factors in addition to these that one can identify.
d. The coefficient of determination (r2) = (0.63)2 = 0.3969.
4-17. The multiple regression equation is Ŷ = $90.00 + $48.50X1 + $0.40X2
a. Number of days on the road: X1 = 5; Distance traveled: X2 = 300 miles
The amount he may be expected to claim is
Ŷ = 90.00 + 48.50(5) + $0.40(300) = $452.50
b. The reimbursement request, according to the model, appears to be too high. However, this
does not mean that it is not justified. The accountant should question Thomas Williams about
his expenses to see if there are other explanations for the high cost.
c. A number of other variables should be included, such as the type of travel (air or car),
conference fees if any, and expenses for entertainment of customers, and other transportation
(cab and limousine) expenses. In addition, the coefficient of correlation is only 0.68 and r2 =
(0.68)2 = 0.46. Thus, about 46% of the variability in the cost of the trip is explained by this
model; the other 54% is due to other factors.
4-7
Copyright © 2018 Pearson Education, Inc.
4-18. Using computer software to get the regression equation, we get
Ŷ = 1.03 + 0.0011X
where Ŷ = predicted GPA and X = SAT score.
If a student scores 1200 on the SAT, we get
Ŷ = 1.03 + 0.0011(1200) = 2.35.
If a student scores 2400 on the SAT, we get
Ŷ = 1.03 + 0.0011(2400) = 3.67, but this is extrapolating outside the range of X values
4-19. a. A linear model is reasonable from the graph below.
b. Ŷ = 5.060 + 1.593X
c. Ŷ = 5.060 + 1.593(10) = 20.99, or 2,099,000 people.
d. If there are no tourists, the predicted ridership would be 5.06 (100,000s) or 506,000.
Because X = 0 is outside the range of values that were used to construct the regression model,
this number may be questionable.
4-20. The F-value for the F-test is 52.6 and the significance level is extremely small (0.00002)
which indicates that there is a statistically significant relationship between number of tourists
and ridership. The coefficient of determination is 0.84 indicating that 84% of the variability in
ridership from one year to the next could be explained by the variation in the number of tourists.
4-21. a. Ŷ = 24,328 + 3026.67X1 + 6684X2
where Ŷ predicted starting salary; X1 = GPA; X2 = 1 if business major, 0 otherwise.
b. Ŷ = 24,328 + 3026.67(3.0) + 6684(1) = $40,092.01.
c. The starting salary for business majors tends to be about $6,684 higher than
non-business majors in this sample, even after adjusting for variations in GPA.
d. The overall significance level is 0.099 and r2 = 0.69. Thus, the model is significant at
the 0.10 level and 69% of the variability in starting salary is explained by GPA and major.
The model is useful in predicting starting salary.
4-8
Copyright © 2018 Pearson Education, Inc.
4-22. a. Let
Ŷ = predicted selling price
X1 = square footage
X2 = number of bedrooms
X3 = age
The model with square footage: Ŷ = 26,532.24 + 51.03X1; r2 = 0.700
The model with number of bedrooms: Ŷ = 20,331.63 + 41,403.06X2; r2 = 0.4332
The model with age: Ŷ = 182,504.70 – 2,424.91X3; r2 = 0.703
All of these models are significant at the 0.01 level or less. The best model uses age as the
independent variable. The coefficient of determination is highest for this, and it is significant.
4-23. The calculations are made using two decimal places. Slight differences may occur if more
decimals are used in the calculations.
Ŷ = 24,202.38 + 49.70X1 + 1,775.16X2 and r2 = 0.7003.
Ŷ = 24,202.38 + 49.70(2000) + 1,775.16(3) = 128,927.90.
Notice the r2 value is virtually the same as it was in the previous problem with just square
footage as the independent variable. With a t-test p-value of 0.904, adding the number of
bedrooms did not add statistically significant information that was not already captured by the
square footage. It should not be included in the model. The r2 for this is lower than for age alone
in the previous problem.
4-24. Ŷ = 91,446.49 + 29.86X1 + 2,116.86X2 – 1,504.77X3 = and r2 = 0.87.
Ŷ = 91,446.49 + 29.86(2000) + 2,116.86(3) – 1,504.77(10) = 142,469.40.
4-25. With one independent variable, beds, in the model, r2 = 0.88. With just admissions in the
model, r2 = 0.974. When both variables are in the model, r2 = 0.975. Thus, the model with only
admissions as the independent variable is the best. Adding the number of beds had virtually no
impact on r2, and the adjusted r2 decreased slightly. Thus, the best model is Ŷ = 1.518 + 0.6686X
where Y = expense and X = admissions.
4-26. Using Excel with Y = MPG; X1 = horsepower; X2 = weight the models are:
Ŷ = 53.87 – 0.269X1; r2 = 0.77
Ŷ = 57.53 – 0.01X2; r2 = 0.73.
Thus, the model with horsepower as the independent variable is better since r2 is higher.
4-27. Ŷ = 57,69 – 0.17X1 – 0.005X2 where
Y = MPG
X1 = horsepower
X2 = weight
r2 = 0.82.
This model is better because the coefficient of determination is much higher with both variables
than it is with either one individually.
4-9
Copyright © 2018 Pearson Education, Inc.
4-28. Let Y = MPG; X1 = horsepower; X2 = weight
The model Ŷ = b0 + b1X1 + b2X12 is Ŷ = 69.93 –0.620X1 + 0.001747X12 and has r2 = 0.798.
The model Ŷ = b0 + b3X2 + b4X22 is Ŷ = 89.09 – 0.0337X2 + 0.0000039X22 and has r2 = 0.800.
The model Ŷ = b0 + b1X1 + b2X12 + b3X2 + b4X22 is Ŷ = 89.2 – 0.51X1 + 0.001889X12 – 0.01615X2
+ 0.00000162X22 and has r2 = 0.883. This model has a higher r2 value than the model in 4-28. A
graph of the data would show a nonlinear relationship.
4-29. If SAT median score alone is used to predict the cost, we get
Ŷ = –11364.7 + 21.6X1 with r2 = 0.22 and a significance level of 0.049.
If both SAT and a dummy variable (X2 = 1 for private, 0 otherwise) are used to predict the cost,
we get r2 = 0.79. The model is
Ŷ = 10988.5 + 4.97X1 + 14065.8X2.
This says that a private school tends to be about $14,065 more expensive than a public school
when the median SAT score is used to adjust for the quality of the school. The coefficient of
determination indicates that about 79% of the variability in cost can be explained by these
factors. The model is significant at the 0.001 level.
4-30. Y = 77.19 + 0.047X
There is no significant relationship between the number of victories (Y) and the payroll (X). The
F-test significance level is 0.51 and r2 = 0.04. This model fails to prove that the payroll impacts
the number of victories. Using the equation to predict number of victories for a team with a
payroll of $79 million would yield 80.90, but is an unreliable prediction due to the weakness of
the model.
4-31. Let Y = number of wins; X1 = ERA; X2 = runs; X3 = batting average; X4 = on base
percentage. A 0.05 level of significance is assumed for the tests of significance. Other levels
such as 0.01 or 0.10 could be used.
a. The model with ERA is Ŷ = 155.09 – 17.87X1. The p-value for the F test is 0.0005 so the
model is significant, and r2 = 0.65.
b. The model with runs is Ŷ = -6.88 + 0.12X3. The p-value for the F test is 0.031 so the model is
significant, and r2 = 0.33.
c. The model with batting average is Ŷ = 62.25 + 77.90X2.The p-value for the F test is 0.751 so
the model is not significant, and r2 = 0.009.
d. The model with on-base percentage is Ŷ = -10.34 + 289.01X4.
0.0.278 so the model is not significant and has r2 = 0.10.
The p-value for the F test is
e). A good model must be statistically significant, so only the model with ERA and the model
with runs are helpful. Of these two, the model with ERA has a higher r2 so it is the best of these
models. With both ERA and Runs, the p-value for the F test is less than 0.0001 so the model is
significant, and r2 = 0.91. If other variables are added to this, r2 is still 0.91, so they do not
improve the model. Also, the adjusted r2 value decreases slightly when other variables are
added. The equation of this best model with ERA and Runs is Ŷ = 72.32 - 16.88X1 + 0.11X2.
4-10
Copyright © 2018 Pearson Education, Inc.
4-32. a. Ŷ = 42.43 + 0.0004X
b. Ŷ = -31.54 + 0.0058X
c. The correlation coefficient for the first stock is only 0.19 while the correlation
coefficient for the second is 0.96. Thus, there is a much stronger correlation between stock
2 and the DJI than there is for stock 1 and the DJI.
4-33. Ŷ = 201.81 + 49.85 (6.45)
The adjusted R square for the associated linear regression model is 0.886308 which is very
good. However, the model is derived from only five data points which is very small.
4-11
Copyright © 2018 Pearson Education, Inc.
CASE STUDIES
SOLUTION TO NORTH–SOUTH AIRLINE CASE
Northern Airline Data
Airframe Cost
Year
per Aircraft
2001
51.80
2002
54.92
2003
69.70
2004
68.90
2005
63.72
2006
84.73
2007
78.74
Engine Cost
per Aircraft
43.49
38.58
51.48
58.72
45.47
50.26
79.60
Average Age
(Hours)
6,512
8,404
11,077
11,717
13,275
15,215
18,390
Southeast Airline Data
Airframe Cost
Year
Per Aircraft
2001
13.29
2002
25.15
2003
32.18
2004
31.78
2005
25.34
2006
32.78
2007
35.56
Engine Cost
per Aircraft
18.86
31.55
40.43
22.10
19.69
32.58
38.07
Average Age
(Hours)
5,107
8,145
7,360
5,773
7,150
9,364
8,259
Utilizing QM for Windows, we can develop the following regression equations for the variables
of interest.
Northern Airline—airframe maintenance cost:
Cost = 36.10 + 0.0025 (airframe age)
Coefficient of determination = 0.7694
Coefficient of correlation = 0.8771
Northern Airline—engine maintenance cost:
Cost = 20.57 + 0.0026 (airframe age)
Coefficient of determination = 0.6124
Coefficient of correlation = 0.7825
Southeast Airline—airframe maintenance cost:
Cost = 4.60 + 0.0032 (airframe age)
Coefficient of determination = 0.3904
Coefficient of correlation = 0.6248
Southeast Airline—engine maintenance cost:
Cost = −0.671 + 0.0041 (airframe age)
Coefficient of determination = 0.4599
Coefficient of correlation = 0.6782
4-12
Copyright © 2018 Pearson Education, Inc.
The graphs below portray both the actual data and the regression lines for airframe and
engine maintenance costs for both airlines. Note that the two graphs have been drawn to the
same scale to facilitate comparisons between the two airlines.
Northern Airline: There seem to be modest correlations between maintenance costs and
airframe age for Northern Airline. There is certainly reason to conclude, however, that airframe
age is not the only important factor.
Southeast Airline: The relationships between maintenance costs and airframe age for
Southeast Airline are much less well defined. It is even more obvious that airframe age is not the
only important factor—perhaps not even the most important factor.
4-13
Copyright © 2018 Pearson Education, Inc.
Overall, it would seem that:
1. Northern Airline has the smallest variance in maintenance costs, indicating that the dayto-day management of maintenance is working pretty well.
2. Maintenance costs seem to be more a function of airline than of airframe age.
3. The airframe and engine maintenance costs for Southeast Airline are not only lower but
more nearly similar than those for Northern Airline, but, from the graphs at least, appear to
be rising more sharply with age.
4. From an overall perspective, it appears that Southeast Airline may perform more
efficiently on sporadic or emergency repairs, and Northern Airline may place more emphasis
on preventive maintenance.
Ms. Jones’s report should conclude that:
1. There is evidence to suggest that maintenance costs could be made to be a function of
airframe age by implementing more effective management practices.
2. The difference between maintenance procedures of the two airlines should be
investigated.
3. The data with which she is presently working do not provide conclusive results.
4-14
Copyright © 2018 Pearson Education, Inc.
Download