Key for Nov. 10 Lab

advertisement
Stat 501
Nov. 10 LAB KEY
For parts 1-5, use the dataset marketshare.mtw at www.stat.psu.edu/~rho/501data/. The dataset gives information about
the market share (of sales) for a food product over n = 36 months. Y = market share, x1 = price of the product, x2 =
Nielsen rating of advertising exposure for product, x3 = 1 if discount price promotion was in effect and 0 otherwise, x4
= 1 if package promotion was in effect and 0 otherwise. Also the dataset includes the six possible multiplicative
interactions between pairs of x variables.
1. Use Stat>Regression>Best Subsets to identify potentially good models for predicting Y = market share. All xvariables and interactions are possible predictors. Based on the results, what variables are in the model you think might
be the best? Why do you think this might be the best model?
The model with the lowest Cp includes x1, x3 and the interaction x3*x4.
2. A forward stepwise procedure identifies a model by first picking the strongest predictor, then adding the next
strongest predictor given the first x-variable is in the model, and so on until variables can’t be added due to lack of
statistical significance. There’s no guarantee the procedure stops at the best model, so these days stepwise procedures
aren’t used as often as the best subsets procedures. Use Stat>Regression>Stepwise to carry out a stepwise procedure.
What variables are in the final model (the last column of output)? Does this agree with what you found in part 1?
The results agree with the best subsets result in the previous part.
3. Do a multiple regression using the predictors you think are in the best model. Store the residuals and fits. Write the
estimated model. The regression equation is
Y_Share = 3.20 - 0.334 x1_price + 0.308 x3_Discount + 0.176 x3x4
4. Using Graph>Scatterplot, With Groups, plot Fits versus x1 = Price using x3 and x4 as grouping variables. What is
indicated about how x1= price, x3 = discount pricing, and x4 = package promotion affect estimated market share (Y)?
Plot shows:
1. Overall market share goes down as x1 =
price goes up.
2. Market share is higher when x3_Discount
=1 than when x3 = 0.
3. x4 = package promotion has no effect when
x3_Discount = 0, but does have an effect
when x3_Discount = 1.
4. The best market share occurs when both x3
and x4 = 1.
5. Using Graph>Scatterplot, With Groups, plot Residuals versus Fits and use x3 and x4 as grouping variables. Briefly
interpret the result.
Plot shows:
There is nonconstant variance. Variance is
smaller for points with x3_Discount = 0.
.
For parts 6-11 , use the dataset bodyfat.mtw at www.stat.psu.edu/~rho/501data/. Y = measure of body fat, x1 = triceps
skinfold measurement, x2 = thigh circumference, x3 = midarm circumference
6. Do a simple regression using x2 = thigh to predict y. (a) What is the estimated slope? (b) What is the standard error of
this estimated slope? (c) Is the linear relationship statistically significant?
(a) slope = 0.8565 (b) s.e. = 0.1100 (c) p-value = 0.000, so result is significant
7. Do a multiple regression using all three x-variables to predict y. (a) What is the estimated coefficient
multiplying x2=thigh? (b) What is the standard error of this estimated coefficient? (c) Compare the standard
error here to the standard error found in part 6. (d) Is x2= thigh statistically significant within this multiple
regression?
(a) coeff = −2.857, (b) s.e. = 2.582, (c) s.e. much larger than in previous part, (d) p= 0.285 so
result is not significant.
8. Do a multiple regression in which you predict x2 = thigh (response variable for this) using the other two x-variables
as predictors. Store the residuals. What is the value of R2 for this regression? What is indicated about the x-variables?
R-Sq = 99.8% indicating nearly perfect collinearity among the variables.
9. Do a multiple regression in which you predict y = bodyfat using x1 = triceps and x3= midarm as predictors. Store the
residuals. Then, do a simple regression using the residuals from this part as the response variable and the residuals from
the previous part (part 8) as the predictor variable. (a) What is the slope? (b) Compare this value to the estimated
coefficient found in part 7a.
(a) Slope = -2.857, (b) same as in part 7a.
Note: This is a demonstration of the fact that in a multiple correlation, a coefficient describes the
relationship between the parts of y and an x that are not explained by the other x-variables in the
equation.
10. Again use all three x-variables to predict y in a multiple regression. Use Options and select Display: Variance
Inflation Factors. What is the VIF (variance inflation factor) reported on the output for x2 = thigh?
VIF for thigh = 564.3
11. Refer back to part 8 in which we found the R2 for the relationship between x2 = thigh and the other two x-variables.
Using the R2 found there, calculate
1
and compare this to the value found in part 10.
1 R 2
1
 500 . This is not the same as in the previous part, but in theory it is. We got victimized by
1  0.998
round-off error. I computed the R2 in part 8 to more decimal places, and found R2 = 0.998228. With
1
this value, we get
 564.3
1  0.998228
-------------------------Interpretation of VIF: A variance inflation factor measures how correlation among the x-variables affects
the standard error (or variance = squared standard error) of an estimated coefficient in a multiple regression.
In the presence of a high VIF (book suggests high is > 10) we have imprecise estimates of the coefficients.
Download