Test #2

advertisement
Test #2 Answers
STAT 870
Fall 2012
Complete the problems below. Make sure to fully explain all answers and show your work to receive
full credit!
1) (62 total points, p. 300 of KNN) A researcher studied the effects of the charge rate and
temperature on the life of a new type of power cell in a preliminary small-scale experiment. The
charge rate (X1) was controlled at three levels (0.6, 1.0 and 1.4 amperes) and the ambient
temperature (X2) was controlled at three levels (10, 20, 30C). Factors pertaining to the discharge
of the power cell were held at fixed levels. The life of the power cell (Y) was measured in terms of
the number of discharge-charge cycles that a power cell underwent before it failed. The data is
available at http://www.chrisbilder.com/test.htm. Below is how I read the data set into R.
set1<-read.table(file = "C:\\data\\test2.csv", header=TRUE, sep = ",")
head(set1)
Note that there are n = 11 observations. Answer the following questions.
a) (6 points) Find the second order sample regression model using X1 and X2 to predict Y.
>
>
mod.fit<-lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2) + I(x2^2), data = set1)
summary(mod.fit)
Call:
lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2) + I(x2^2), data = set1)
Residuals:
1
2
-21.465
9.263
11
13.202
3
12.202
4
41.930
5
6
-5.842 -31.842
7
8
9
21.158 -25.404 -20.465
10
7.263
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 337.7215
149.9616
2.252
0.0741 .
x1
-539.5175
268.8603 -2.007
0.1011
x2
8.9171
9.1825
0.971
0.3761
I(x1^2)
171.2171
127.1255
1.347
0.2359
I(x2^2)
-0.1061
0.2034 -0.521
0.6244
x1:x2
2.8750
4.0468
0.710
0.5092
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 32.37 on 5 degrees of freedom
Multiple R-squared: 0.9135,
Adjusted R-squared: 0.8271
F-statistic: 10.57 on 5 and 5 DF, p-value: 0.01086
The sample regression model is Ŷ = 337.7215 - 539.5175X1 + 8.9171X2 + 171.2171 X12
-0.1061 X22 + 2.8750X1X2
b) (8 points) Suppose you were using one of these new power cells. Find the appropriate 95%
interval estimate of its life for a charge rate of 1.0 and power cell stored at 20. Interpret the
interval. Use the second order model in part a) when calculating the interval.
1
ˆ h  t(1   / 2,n  p) Var(Yh(new )  Y
ˆh ) , where Var(Yh(new)  Y
ˆh ) 
The interval formula is Y
MSE  (1  Xh ( XX)1 Xh ) and Xh = (1, 1, 20)
The prediction interval is: 69.31 < Y < 256.37. I am 95% confident that the lifetime of the
power cell will be between 69.31 and 256.37 discharge-charge cycles.
>
predict(object = mod.fit, newdata = data.frame(x1 = 1, x2 = 20), interval =
"prediction", level = 0.95)
fit
lwr
upr
1 162.8421 69.31035 256.3739
c) (6 points) Provide three ways in R in which the first order terms and an interaction can be
coded into a formula statement for the lm() function. Note that you do not need to find the
actual sample regression model.
x1 + x2 + x1:x2, x1*x2, (x1+x2)2
d) (6 points) Write out what the X matrix would be for the model of
E(Y)  0  1X1  2 X2  3 X1X2. You only need to write out the first and last observations in
your X matrix to receive credit.
1 X11 X12
X  
1 Xn1 Xn2
X11X12  1 0.6 10 6 


 

Xn1Xn2  1 1.4 30 42
Note that the model.matrix() function in R could be used to help find this as well.
e) (15 points) Perform the appropriate hypothesis test to compare the models of
E(Y)  0  1X1  2 X2 to E(Y)  0  1X1  2 X2  3 X1X2  4 X12 . Use  = 0.05.
i)
ii)
iii)
iv)
v)
>
>
Ho: 3 = 4 = 0 vs. Ha: at least one ≠ 0
F = 1.18, p-value of 0.3694
 = 0.05
Because 0.3694 > 0.05, do not reject Ho.
There is not sufficient evidence to indicate the extra terms in the Ha model are needed.
mod.fit1<-lm(formula = y ~ x1 + x2, data = set1)
mod.fit2<-lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2), data = set1)
>
anova(mod.fit1, mod.fit2)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
8
2
6
y ~ x1 + x2
y ~ x1 + x2 + x1:x2 + I(x1^2)
RSS Df Sum of Sq
F Pr(>F)
7700.3
5525.4 2
2175.0 1.1809 0.3694
f) Consider the models of E(Y) = 0 + 1X1, E(Y) = 0 + 1X2, and E(Y) = 0 + 1X1 + 2X2 and
answer the following questions.
2
i) (7 points) Using the AIC criteria, which model of the three is the best? Explain.
The model with X1 and X2 since it has the smallest AIC value.
Using my fit.stat() function:
>
>
>
>
>
sum.fit<-summary(mod.fit1)
mod1<-fit.stat(model = y ~
mod2<-fit.stat(model = y ~
mod3<-fit.stat(model = y ~
rbind(mod1, mod2, mod3)
Rsq
AdjRsq AIC.stat
1 0.8729444 0.8411805 111.2790
2 0.3086191 0.2317990 127.9137
3 0.5643253 0.5159170 122.8340
x1 + x2, data = set1, MSE.all = sum.fit$sigma^2)
x1, data = set1, MSE.all = sum.fit$sigma^2)
x2, data = set1, MSE.all = sum.fit$sigma^2)
SBC.stat
PRESS
Cp
112.8706 14790.59 3.00000
129.1074 67914.22 36.53249
124.0277 36136.55 20.43206
ii) (7 points) Using the R2 criteria, which model of the three is the best? Explain.
The model with X1 and X2 since it has the largest R2 value and there is a significant
improvement from the one variable models to the two variable model.
iii) (7 points) Calculate the AIC for the second order model. Is this better than the model(s)
you chose previously? Explain.
>
AIC(mod.fit, k = 2)
[1] 113.0456
The AIC value is 113.0456 which is larger than the value for the first-order model chosen
above. Therefore, the second order model is not better.
2) (38 total points) Answer the following questions.
a) (6 points) Why is the adjusted R2 preferred over R2 as a measurement of model fit?
The adjusted R2 adjusts for the number of predictor variables in the model. Unlike the R 2, the
adjusted R2 can not be made to increase by adding unimportant variables to the model.
b) (10 points) For the regression model of E(Y) = 0+ 1X1 +  + gXg + g+1Xg+1+  + p-1Xp-1,
the partial F-test uses the following statistic
F 
 SSE(X1,..., Xg )  SSE(X1,..., Xg, Xg1,..., Xp1 ) 
(p  1  g)
SSE(X1,..., Xg , Xg1,..., Xp1 ) (n  p)
for the test of Ho: g+1 =  = p-1=0 vs. Ha: At least one of the ’s in Ho are not 0. Why do
large values of F indicate evidence against the null hypothesis? Answer the question without
reference to the critical value or a p-value.
The key part to answering this question is to understand what is represented by the numerator.
Notice that if SSE(X1, …, Xg) is very large compared to SSE(X1, …, Xg, Xg+1, …, Xp-1), Fwill be
very large. This occurs when the additional variables of Xg+1, …, Xp-1 substantially lower the
SSE. Thus, these additional variables are important given the other variables are in the model.
c) (6 points) Why do values of Cp close to p indicate a good model?
3
Cp 
SSE(X1,..., Xp-1 )
 (n  2p)
MSE(X1,..., XP-1 )
SSE(X1,..., Xp-1 ) (n  p)MSE(X1,..., Xp-1 )

 n  p when the p-1 predictor variables are
MSE(X1,..., XP-1 )
MSE(X1,..., XP-1 )
about just as good as all of the P-1 variables. This leads Cp to be approximately n – p – (n –
2p) = p.
Note that
d) (10 points) Explain stepwise forward selection.
This is a combination of forward selection and backward elimination. The process begins with
two steps of forward selection in order to add two predictor variables to a model with no
variables to begin with. Backward elimination is then performed to see if any of these two
predictor variables can be removed. Forward selection and backward elimination are then
alternated between to see if any more variables can be added or removed from the model.
The process stops when there are no more variables that can be added or removed.
e) (6 points) What is the difference between SSR(X2) and SSR(X2 | X1)? Explain.
SSR(X2) is the regular sums of squares for regression without any other predictor variables in
the model. SSR(X2 | X1) measures the amount that SSE is reduced by adding X2 to a
regression model with X1 already in it. Because this measures the reduction in SSE, it is
examining the amount of additional information being explained by X 2 (this is why we call it a
SSR).
3) (3 points, extra credit) What year did Reggie Miller get inducted into the Naismith Memorial
basketball hall of fame?
2012 – Note that we discussed during class whether or not Reggie Miller was a member of the
hall of fame. While we were not sure, one could have looked it up after class 
4
Download