Lab_Activity_8

advertisement
Lab Activity 8 (08/01/2013)
Part 1: Use the “Marketshare” dataset (adapted from dataset C3 in the text). The data are 36 monthly
observations on variables affecting sales of a product. The objective is to determine an efficient model
for predicting and explaining market share sales. The predictors are average monthly price (X1),
amount of advertising exposure based on gross Nielson rating (X2), whether a discount price was in
effect (X3=1 if discount, 0 otherwise), whether a package promotion was in effect (X4=1 if promotion,
0 otherwise), and time in months (X5).
(a) (10pts) Do a “best subsets” procedure to identify a possible model for predicting Y=Share. In
Minitab, use StatRegressionBest Subsets. Enter all predictors as “Free predictors.” What are
the best three predictors? What are the best four predictors?
Response is Y_Share
Vars
1
1
2
2
3
3
4
4
5
R-Sq
62.5
9.3
66.0
65.8
70.7
69.1
71.1
70.7
71.1
R-Sq(adj)
61.4
6.6
64.0
63.8
67.9
66.2
67.4
66.9
66.3
Mallows
Cp
6.9
62.2
5.3
5.5
2.5
4.1
4.0
4.5
6.0
S
0.16420
0.25545
0.15865
0.15915
0.14979
0.15365
0.15101
0.15215
0.15348
x
3
_
x
D x
1
i 4 x
_
s _ 5
p
c p _
r
o r t
i
u o i
c
n m m
e
t o e
X,,,
X
X X
X
X
X
X X
X X X
X
X X X
X X X X
X X X X X
x
2
_
n
i
e
l
s
e
n
Best three predictors are: X1 Price, X3 Discount, X4 Promotion
Best four predictors are: X1 Price, X3 Discount, X4 Promotion, X5 time
I have highlighted in the chart above.
(b) (10pts) For the best 3-predictor model, use StatRegressionRegression to estimate the
regression equation (write it below). Also, fit the best 4-predictor model by using StatRegression
Regression. Based on the information in the ANOVA table in each of these two models, calculate
the AIC and BIC values respectively (10pts/each statistic).
Y_Share = 3.19 - 0.353 x1_price + 0.399 x3_Discount + 0.118 x4_promo
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
32
35
SS
1.72830
0.71795
2.44626
MS
0.57610
0.02244
F
25.68
P
0.000
AIC=36*ln(0.71795/36)+2*4 =-140.94+8= -132.94
BIC=36*ln(0.71795/36)+4*ln(36)= -140.94+14.33= -126.61
Y_Share = 3.05 - 0.279 x1_price + 0.405 x3_Discount + 0.119 x4_promo - 0.00206 x5_time
Analysis of Variance
Source
Regression
Residual Error
Total
DF
4
31
35
SS
1.73932
0.70693
2.44626
MS
0.43483
0.02280
F
19.07
P
0.000
AIC=36*ln(0.70693 /36)+2*5=-141.49+10= -131.49
BIC=36*ln(0.70693 /36)+ 5*ln(36)= -141.49+17.92= -123.57
(3pts)Based on the AIC values of the two models you considered in part b, which model is considered
better?
It is better when three variables are in the model because the AIC of it is lower.
(3pts)Based on the BIC values of the two models you considered in part b, which model is considered
better?
The four variables model is better because the BIC of it is lower.
(c) (10pts) Still consider the five predictor variables, use stepwise procedure to select the best model
by using StatRegressionStepwise. In the “Methods” option, consider stepwise procedure and
set the alpha-to-enter = alpha-to-remove = 0.10.
Stepwise Regression: Y_Share versus x1_price, x2_nielsen, ...
Alpha-to-Enter: 0.1
Alpha-to-Remove: 0.1
Response is Y_Share on 5 predictors, with N = 36
Step
Constant
1
2.420
2
2.374
3
3.185
x3_Discount
T-Value
P-Value
0.418
7.53
0.000
0.403
7.43
0.000
0.399
7.79
0.000
0.100
1.85
0.073
0.118
2.29
0.029
x4_promo
T-Value
P-Value
x1_price
T-Value
P-Value
-0.35
-2.24
0.032
S
R-Sq
R-Sq(adj)
Mallows Cp
0.164
62.53
61.42
6.9
0.159
66.04
63.99
5.3
0.150
70.65
67.90
2.5
There are three variables finally in the model. They are X1price, X3discount and X4promotion.
(d) (10pts) Continue with part (c) and still use stepwise procedure. But, here set alpha-to-enter=0.05
and alpha-to-remove=0.10. Did you get the same result as in part (c)?
Stepwise Regression: Y_Share versus x1_price, x2_nielsen, ...
Alpha-to-Enter: 0.05
Alpha-to-Remove: 0.1
Response is Y_Share on 5 predictors, with N = 36
Step
Constant
1
2.420
x3_Discount
T-Value
P-Value
0.418
7.53
0.000
S
R-Sq
R-Sq(adj)
Mallows Cp
0.164
62.53
61.42
6.9
Not the same result. Only one variable in the model finally is same in this part.
(e) (10pts) Continue with part (c), now, we consider the backward elimination approach (start with the
full model and get rid of the “weakest” predictor each time). Here, in the “Methods” option, click
“Backward elimination” and set alpha-to-remove=0.10. Did you get the same result as in part (c)?
Stepwise Regression: Y_Share versus x1_price, x2_nielsen, ...
Backward elimination.
Alpha-to-Remove: 0.1
Response is Y_Share on 5 predictors, with N = 36
Step
Constant
1
3.067
2
3.047
3
3.185
x1_price
T-Value
P-Value
-0.28
-1.42
0.165
-0.28
-1.46
0.155
-0.35
-2.24
0.032
x2_nielsen
T-Value
P-Value
-0.00002
-0.10
0.917
x3_Discount
T-Value
P-Value
0.405
7.58
0.000
0.405
7.73
0.000
0.399
7.79
0.000
x4_promo
T-Value
P-Value
0.120
2.20
0.036
0.119
2.29
0.029
0.118
2.29
0.029
-0.0022
-0.68
0.500
-0.0021
-0.70
0.492
0.153
71.11
66.30
6.0
0.151
71.10
67.37
4.0
x5_time
T-Value
P-Value
S
R-Sq
R-Sq(adj)
Mallows Cp
0.150
70.65
67.90
2.5
Yes. I got three variables left in this model and they are the same as what they were in part C
Part 2. (20pts) Consider the circled observations in the following scatter plots, are they outliers? If yes,
are they outlying respect to their Y value, or X value, or both?
(A)
(B)
(C)
A)
B)
C)
D)
(D)
Yes. Respect to X.
Yes. Respect to X and Y.
Yes. Respect to Y.
No.
Part 3. Use the dataset “Prostate” available online, which consists of n = 97 prostate cancer patients.
The response of interest is prostate specific antigen (y = PSA_level), a blood chemistry measurement.
Among the predictors, we’ll consider the cancer volume (x1= Capsular) and a measure of the
invasiveness of the cancer (x2 = CancerVol) for this activity.
(a). 10pts--Suppose the researcher wanted to use X1 to explain the response but she was not sure
whether X2 would be useful in explaining the variation in the data set given the existence of X1. Use
an added-variable plot to help her to make a decision.
Notice that in order to obtain an added-variable plot, you need to:
Step1: regress Y on X1 and save the residuals, called RESI1 (click “Storage” option and check
residuals while fitting the regression model via StatRegressionRegression)
Step2: regress X2 on X1 and save the residuals, called RESI2.
Step3: draw a scatter plot of RESI1 versus RESI2.
Scatterplot of RESI1 vs RESI2
200
150
RESI1
100
50
0
-50
-100
-10
0
10
RESI2
20
30
(b). (5pts) Do you observe any pattern above, what do you suggest on this add-variable plot?
It is an up sloping pattern that fits the dots in the graph. It shows that the predictor SESI2 might be a
significant addition to this model.
(c). (4pts) Fit both X1 and X2 in the model and look at the last part of your regression analysis under
“Unusual Observations”. Locate the row for observation 96, what is the Residual, and
standardized/student residual?
Unusual Observations
Obs
47
55
79
91
94
95
96
97
CancerVol
15.3
23.3
14.2
25.8
45.6
18.4
17.8
32.1
PSA_level
13.07
14.88
31.82
56.26
107.77
170.72
239.85
265.07
Fit
73.41
57.66
67.87
63.58
132.90
74.35
56.00
123.49
SE Fit
11.88
11.68
10.98
13.00
17.71
8.74
5.59
13.95
Residual
-60.35
-42.78
-36.05
-7.32
-25.13
96.36
183.84
141.58
St Resid
-2.07RX
-1.46 X
-1.22 X
-0.26 X
-0.97 X
3.19R
5.93R
5.02RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
The residual is 183.84. And the standardized residual is 5.93.
(d) (5pts) There is another way to find standardized/student residual, when in the steps of
Regression>Regression, Choose Storage, you can also choose Residual and Standardized Residual.
Minitab will generate two new columns with residuals and standardized residuals. Can you locate the
row for observation 96, do you get the same answer as in (c)? Do you notice any other large value of
standardized residuals (greater than 2.5 or smaller than -2.5), which observation it belongs to?
I got the same answer.
The two others are 95 and 97.
Download