Stat 462 Homework 6 Key 1. This problem reviews indicator variables. Use the dataset marketshare.txt that can be linked at www.stat.psu.edu/~rho/462data/. The data are from Appendix dataset C3 on page 679 of the book, and the variables are described there. I changed the dataset structure in two ways – I eliminated the ID column and converted there two time columns to a single column, x5_time, that goes from 1 to 36. A. Use Minitab’s Best Subsets Regression to identify the best subset of the five x-variables. What model is “best?” Why? Model containing x1 = price, c3=discout, and c4=promo is best. C p is low, R2-adj. high. Vars R-Sq R-Sq(adj) C-p S 1 1 2 2 3 3 4 4 5 62.5 9.3 66.0 65.8 70.7 69.1 71.1 70.7 71.1 61.4 6.6 64.0 63.8 67.9 66.2 67.4 66.9 66.3 6.9 62.2 5.3 5.5 2.5 4.1 4.0 4.5 6.0 0.16420 0.25545 0.15865 0.15915 0.14979 0.15365 0.15101 0.15215 0.15348 x 1 _ p r i c e x 2 _ n i e l s x 3 _ D i s c o x 4 _ p r o m o x 5 _ t i m e X X X X X X X X X X X X X X X X X X X X X X X X X B. Fit the regression model identified in part A. What is the estimated equation? What is the value of R 2? The regression equation is Y_Share = 3.19 - 0.353 x1_price + 0.399 x3_Discount + 0.118 x4_promo R-Sq = 70.7% C. For the model in part C, plot residuals versus fits. Write a brief interpretation. Generally the plot looks okay. Possibly one mild outlier and possibly a mild nonconstant variance, but nothing serious. D. Based on the regression results, which is more effective for increasing market share – discount pricing or a package promotion? Explain. Discount pricing is more effective, The coefficient is greater as is the t-statistic for significance. Predictor Constant x1_price x3_Disco x4_promo Coef 3.1853 -0.3527 0.39914 0.11803 SE Coef 0.3651 0.1574 0.05125 0.05149 T 8.73 -2.24 7.79 2.29 P 0.000 0.032 0.000 0.029 E. In general, what is the most effective strategy for getting a high market share? Explain based on the regression results. Low price, and use discount pricing and a package promotion. F. There are four possible combinations of values for discount pricing and package promotion. Use the regression results to determine the regression equation relating y = market share to the other x-variable(s) present in the model identified in part A. Discount = 0, Promo = 0, Discount = 1, Promo = 0, Discount = 0, Promo = 1, Discount = 1, Promo = 1, y-hat = 3.1853 – 0.3527*Price y-hat = (3.1853+0.39914) – 0.3527*Price y-hat = (3.1853 +0.11803) – 0.3527*Price y-hat = (3.1853+0.39914+0.11803) – 0.3527*Price 2. This problem is mainly concerned with Chapter 10. Use the ch09pr13.txt dataset that can be linked from the data set web site. The data are described in exercise 9.13 on pages 377-378 of the text. A. Do a regression in which Y is predicted from the two variables x2 and x3. Store the residuals. Then, do a regression in which the predictor x1 is the response and is predicted from x2. Store the residuals. Plot the first set of residuals versus the second set of residuals. Briefly interpret the plot. Note: This is a variable added plot (as described in Section 10.1). It shows the relationship between y and x1 after “controlling for” x2 and x3. Graph shows a negative association, with possibly either curvature or a few outliers toward the end. B. Calculate columns that contain the three possible interactions between pairs of x-variables, Then use the Best Subsets procedure to identify the best subset of the six predictors x1, x2, x3, x1*x2, x1*x3, x2*x3. Using the Cp criterion, what is the best model? What is the Cp value for that model? Best model has x1, x2, and x1x2 as predictors. Cp = 1.6. C. Fit the regression model identified in part B. Using the Storage button, store, DFITS and Cook’s D. Also, using the Options button ask Minitab for the Variance Inflation Factors. What are the VIF values? What do these values indicate about this situation? VIF for x2 and x1x2 indicate a high degree of multicollinearity among the x-variables. This is understandable because x1x2 is a function of both x1 and x2. Predictor Constant x1 x2 x1x2 Coef 134.40 -2.1330 -1.6993 0.033347 SE Coef 15.98 0.5222 0.3637 0.009283 T 8.41 -4.09 -4.67 3.59 P 0.000 0.001 0.000 0.003 VIF 5.4 11.6 22.5 D. In the Minitab output for part C, what observations are listed as unusual? In each case, what is the reason? Case 7 has a large residusal. Case 8 has etreme x-values, so has great potential influence. Obs 7 8 x1 17.0 63.0 Y 95.00 26.00 Fit 70.96 32.14 SE Fit 4.94 9.92 Residual 24.04 -6.14 St Resid 2.57R -1.66 X E. Inspect the column of the worksheet that contains the DFITS. What’s the largest value? Which observation has that value? What does an excessively large DFITS value indicate? (See page 401 of the text. In the text it’s DFFITS rather than DFITS as in Minitab.) Case 8, value = -4.77978. This indicates a large difference between predicted values for 8th observation when all data is used versus when the 8th point is removed before estimating the equation. F. Inspect the column of the worksheet that contains the Cook’s D values. What’s the largest value? Which observation has that value? What does an excessively large Cook’s distance value indicate? (See page 402 of the text.) Case 8, value = 4.99081. This indicates a large overall difference between predicted values when all data is used versus when the 8th point is removed before estimating the equation. G. Turn the y-value for the most influential case into a missing value by replacing the y-value with an asterisk in the worksheet. Fit the model identified in part B. Write the estimated equation. Then, write the estimated equation found in part C in which all data were used. Use each equation to predict y for the case that you deleted. For each situation (point deleted or included), determine the predicted value for the observation that was deleted. Note: The difference between the two values is an unstandardized version of the DFIT. With point deleted, Y = 158 - 2.71 x1 - 2.77 x2 + 0.0619 x1x2 and yhat for 8 th point =76.46 With all data, Y = 134 - 2.13 x1 - 1.70 x2 + 0.0333 x1x2 and yhat for 8 th point =32.14 Difference = 44.32 3. Use this data set: X 1 2 3 4 5 6 Y 2 4 5 7 10 20 a. Find an unstandardized deleted residual for the 6 th observation. Show how you found this. Delete the 6th point. The resulting regression is Yhat = 0.100 + 1.90 X, so at X = 6, Y-hat = 11.3 and deleted residual = 2011.3 = 8.7. b. Find an unstandardized deleted residual for the 3 rd observation. Show how you found this. Delete the 3rd point. The resulting regression is Yhat = 2.535 + 3.093 X, so at X = 3, Y-hat = 6.744 and deleted residual = 56.744 = 1.744