Homework 6 Key

advertisement
Stat 462 Homework 6 Key
1. This problem reviews indicator variables. Use the dataset marketshare.txt that can be linked at
www.stat.psu.edu/~rho/462data/. The data are from Appendix dataset C3 on page 679 of the book, and the variables
are described there. I changed the dataset structure in two ways – I eliminated the ID column and converted there
two time columns to a single column, x5_time, that goes from 1 to 36.
A. Use Minitab’s Best Subsets Regression to identify the best subset of the five x-variables. What model is “best?”
Why?
Model containing x1 = price, c3=discout, and c4=promo is best. C p is low, R2-adj. high.
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
4
5
62.5
9.3
66.0
65.8
70.7
69.1
71.1
70.7
71.1
61.4
6.6
64.0
63.8
67.9
66.2
67.4
66.9
66.3
6.9
62.2
5.3
5.5
2.5
4.1
4.0
4.5
6.0
0.16420
0.25545
0.15865
0.15915
0.14979
0.15365
0.15101
0.15215
0.15348
x
1
_
p
r
i
c
e
x
2
_
n
i
e
l
s
x
3
_
D
i
s
c
o
x
4
_
p
r
o
m
o
x
5
_
t
i
m
e
X
X
X
X
X
X
X
X
X
X X X
X X X
X
X
X
X X
X X
X
X X
B. Fit the regression model identified in part A. What is the estimated equation? What is the value of R 2?
The regression equation is
Y_Share = 3.19 - 0.353 x1_price + 0.399 x3_Discount + 0.118 x4_promo
R-Sq = 70.7%
C. For the model in part C, plot residuals versus fits. Write a brief interpretation.
Generally the plot looks okay. Possibly one mild outlier and possibly a mild nonconstant variance, but
nothing serious.
D. Based on the regression results, which is more effective for increasing market share – discount pricing or a
package promotion? Explain.
Discount pricing is more effective, The coefficient is greater as is the t-statistic for significance.
Predictor
Constant
x1_price
x3_Disco
x4_promo
Coef
3.1853
-0.3527
0.39914
0.11803
SE Coef
0.3651
0.1574
0.05125
0.05149
T
8.73
-2.24
7.79
2.29
P
0.000
0.032
0.000
0.029
E. In general, what is the most effective strategy for getting a high market share? Explain based on the regression
results.
Low price, and use discount pricing and a package promotion.
F. There are four possible combinations of values for discount pricing and package promotion. Use the regression
results to determine the regression equation relating y = market share to the other x-variable(s) present in the model
identified in part A.
Discount = 0, Promo = 0,
Discount = 1, Promo = 0,
Discount = 0, Promo = 1,
Discount = 1, Promo = 1,
y-hat = 3.1853 – 0.3527*Price
y-hat = (3.1853+0.39914) – 0.3527*Price
y-hat = (3.1853 +0.11803) – 0.3527*Price
y-hat = (3.1853+0.39914+0.11803) – 0.3527*Price
2. This problem is mainly concerned with Chapter 10. Use the ch09pr13.txt dataset that can be linked from the data
set web site. The data are described in exercise 9.13 on pages 377-378 of the text.
A. Do a regression in which Y is predicted from the two variables x2 and x3. Store the residuals. Then, do a
regression in which the predictor x1 is the response and is predicted from x2. Store the residuals. Plot the first set of
residuals versus the second set of residuals. Briefly interpret the plot. Note: This is a variable added plot (as
described in Section 10.1). It shows the relationship between y and x1 after “controlling for” x2 and x3.
Graph shows a negative association, with possibly either curvature or a few outliers toward the end.
B. Calculate columns that contain the three possible interactions between pairs of x-variables, Then use the Best
Subsets procedure to identify the best subset of the six predictors x1, x2, x3, x1*x2, x1*x3, x2*x3. Using the Cp
criterion, what is the best model? What is the Cp value for that model?
Best model has x1, x2, and x1x2 as predictors. Cp = 1.6.
C. Fit the regression model identified in part B. Using the Storage button, store, DFITS and Cook’s D. Also, using
the Options button ask Minitab for the Variance Inflation Factors. What are the VIF values? What do these values
indicate about this situation?
VIF for x2 and x1x2 indicate a high degree of multicollinearity among the x-variables. This is understandable
because x1x2 is a function of both x1 and x2.
Predictor
Constant
x1
x2
x1x2
Coef
134.40
-2.1330
-1.6993
0.033347
SE Coef
15.98
0.5222
0.3637
0.009283
T
8.41
-4.09
-4.67
3.59
P
0.000
0.001
0.000
0.003
VIF
5.4
11.6
22.5
D. In the Minitab output for part C, what observations are listed as unusual? In each case, what is the reason?
Case 7 has a large residusal. Case 8 has etreme x-values, so has great potential influence.
Obs
7
8
x1
17.0
63.0
Y
95.00
26.00
Fit
70.96
32.14
SE Fit
4.94
9.92
Residual
24.04
-6.14
St Resid
2.57R
-1.66 X
E. Inspect the column of the worksheet that contains the DFITS. What’s the largest value? Which observation has
that value? What does an excessively large DFITS value indicate? (See page 401 of the text. In the text it’s DFFITS
rather than DFITS as in Minitab.)
Case 8, value = -4.77978. This indicates a large difference between predicted values for 8th observation when
all data is used versus when the 8th point is removed before estimating the equation.
F. Inspect the column of the worksheet that contains the Cook’s D values. What’s the largest value? Which
observation has that value? What does an excessively large Cook’s distance value indicate? (See page 402 of the
text.)
Case 8, value = 4.99081. This indicates a large overall difference between predicted values when all data is
used versus when the 8th point is removed before estimating the equation.
G. Turn the y-value for the most influential case into a missing value by replacing the y-value with an asterisk in the
worksheet. Fit the model identified in part B. Write the estimated equation. Then, write the estimated equation found
in part C in which all data were used. Use each equation to predict y for the case that you deleted. For each situation
(point deleted or included), determine the predicted value for the observation that was deleted. Note: The difference
between the two values is an unstandardized version of the DFIT.
With point deleted, Y = 158 - 2.71 x1 - 2.77 x2 + 0.0619 x1x2 and yhat for 8 th point =76.46
With all data, Y = 134 - 2.13 x1 - 1.70 x2 + 0.0333 x1x2 and yhat for 8 th point =32.14
Difference = 44.32
3. Use this data set:
X 1 2 3 4 5 6
Y 2 4 5 7 10 20
a. Find an unstandardized deleted residual for the 6 th observation. Show how you found this.
Delete the 6th point. The resulting regression is Yhat =  0.100 + 1.90 X, so at X = 6, Y-hat = 11.3 and deleted
residual = 2011.3 = 8.7.
b. Find an unstandardized deleted residual for the 3 rd observation. Show how you found this.
Delete the 3rd point. The resulting regression is Yhat =  2.535 + 3.093 X, so at X = 3, Y-hat = 6.744 and
deleted residual = 56.744 = 1.744
Download