Answers

advertisement
Project #5 Answers
STAT 870
Fall 2012
Complete the following problems below. Within each part, include your R program output with code
inside of it and any additional information needed to explain your answer. Note that you will need to
edit your output and code in order to make it look nice after you copy and paste it into your Word
document.
1) (22 total points) The purpose of this problem is to find the best model for the cheese data set.
a) (4 points) For the model containing acetic, H2S, and lactic in a linear form, construct added
variable plots for each predictor variable. What are the proper forms for the predictor
variables?
>
>
>
>
>
1
2
3
4
5
6
library(RODBC)
z<-odbcConnectExcel(xls.file = "C:\\data\\cheese.xls")
cheese<-sqlFetch(channel = z, sqtable = "Sheet1")
close(z)
head(cheese)
Case taste Acetic
H2S Lactic
1 12.3 4.543 3.135
0.86
2 20.9 5.159 5.043
1.53
3 39.0 5.366 5.438
1.57
4 47.9 5.759 7.496
1.81
5
5.6 4.663 3.807
0.99
6 25.9 5.697 7.601
1.09
> mod.fit<-lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese)
> summary(mod.fit)
Call:
lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese)
Residuals:
Min
1Q
-17.390 -6.612
Median
-1.009
3Q
4.908
Max
25.449
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.8768
19.7354 -1.463 0.15540
Acetic
0.3277
4.4598
0.073 0.94198
H2S
3.9118
1.2484
3.133 0.00425 **
Lactic
19.6705
8.6291
2.280 0.03108 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.13 on 26 degrees of freedom
Multiple R-squared: 0.6518,
Adjusted R-squared: 0.6116
F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06
> #Added variable plots
> library(car)
Loading required package: MASS
Loading required package: nnet
> avPlots(model = mod.fit)
1
10
0
-20
-10
taste | others
10
0
-10
taste | others
20
20
Added-Variable Plots
-0.5
0.0
0.5
-3
-1
0
1
2
3
H2S | others
10
0
-10
taste | others
20
Acetic | others
-2
-0.4
-0.2
0.0
0.2
Lactic | others
The plot for Acetic has a random scattering of points, so it does not appear that there is a
relationship between Acetic and taste when including Lactic and H2S in the model.
The plots for H2S and Lactic show linear trends suggesting that they are included correctly in
the model as linear terms.
b) (2 points) Through the results in the previous part and from project #4, it appears that acetic
acid may not be important to include the model. Estimate the model with H2S and Lactic only
where the variables are in a linear form.
> mod.fit2<-lm(formula = taste ~ H2S + Lactic, data = cheese)
> summary(mod.fit2)
Call:
lm(formula = taste ~ H2S + Lactic, data = cheese)
Residuals:
Min
1Q
Median
3Q
Max
2
-17.343
-6.530
-1.164
4.844
25.618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.592
8.982 -3.072 0.00481 **
H2S
3.946
1.136
3.475 0.00174 **
Lactic
19.887
7.959
2.499 0.01885 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.942 on 27 degrees of freedom
Multiple R-squared: 0.6517,
Adjusted R-squared: 0.6259
F-statistic: 25.26 on 2 and 27 DF, p-value: 6.551e-07
The estimated model is taste  27.59  3.946H2S  19.887Lactic
c) (3 points) Continuing with the model from part b), show that there is not sufficient evidence to
indicate an interaction between H2S and Lactic is needed. Use  = 0.05.
> mod.fit2.inter<-lm(formula = taste ~ H2S + Lactic + H2S:Lactic, data = cheese)
> summary(mod.fit2.inter)
Call:
lm(formula = taste ~ H2S + Lactic + H2S:Lactic, data = cheese)
Residuals:
Min
1Q
-17.378 -6.296
Median
-1.211
3Q
5.018
Max
25.810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23.187
27.749 -0.836
0.411
H2S
3.236
4.382
0.738
0.467
Lactic
16.725
20.479
0.817
0.422
H2S:Lactic
0.488
2.902
0.168
0.868
Residual standard error: 10.13 on 26 degrees of freedom
Multiple R-squared: 0.6521,
Adjusted R-squared: 0.6119
F-statistic: 16.24 on 3 and 26 DF, p-value: 3.768e-06
H0: 3 = 0 vs. Ha: 3  0
p-value = 0.868
Because 0.868 > 0.05, do not reject H0.
There is not sufficient evidence to indicate an interaction is needed between H2S and Lactic.
d) (8 points) Using my examine.mod.multiple.final() function and the resulting model from b),
comment on 1) linearity of the regression model, 2) constant error variance, 3) outliers, 4)
Influential observations, and 5) normality of .
> source(file = "C:\\examine.mod.multiple.final.R")
> save.it<-examine.mod.multiple.final(mod.fit.obj = mod.fit2, first.order = 2,
const.var.test = TRUE, boxcox.find = TRUE)
3
-10
0
Residuals
10
20
0
10
20
1.6
1.8
30
40
1.2 1.4
1.6
1.8
Predictor variable 2
1.2 1.4
30
2.0
30
Response variable
2.0
20
20
40
40
4
4
8
6
8
Predictor variable 1
6
Predictor variable 1
50
50
10
10
Dot plot
1.0
1.0
10
0
Predictor variable 2
10
0
Response variable
Box plot
Box plot
Dot plot
Box plot
Dot plot
Residuals vs. estimated mean response
50
Estimated mean response
4
ti vs. estimated mean response
2
1
0
-1
-3
-2
Studentized deleted residuals
2
1
0
-1
-2
-3
0
10
20
30
40
50
0
10
Estimated mean response
20
30
40
50
Estimated mean response
10
0
-10
-10
0
Residuals
10
20
Residuals vs. predictor 2
20
Residuals vs. predictor 1
Residuals
Studentized residuals
15
3
3
ri vs. estimated mean response
4
6
8
Predictor variable 1
10
1.0
1.2
1.4
1.6
1.8
2.0
Predictor variable 2
5
Histogram of residuals
0.03
0.01
0.02
Density
10
0
0.00
-10
Residuals
20
Residuals vs. observation number
0
5
10
15
20
25
30
-20
Observation number
-10
0
10
20
30
Residuals
10
0
-10
Residuals
20
Normal Q-Q Plot
-2
-1
0
1
2
Theoretical Quantiles
15
0.0
12
-1.0
DFFITS
0.5
1.0
DFFITS vs. observation number
0
5
10
15
20
25
30
25
30
Observation number
0.6
0.4
0.2
0.0
Cook's D
0.8
Cook's D vs. observation number
0
5
10
15
20
Observation number
6
1.0
0.5
12
-1.0
-0.5
0.0
DFBETAS
1.0
DFBETAS for term 2 vs. observation number
0.5
0.0
-1.0
-0.5
DFBETAS
DFBETAS for term 1 vs. observation number
0
5
10
15
20
25
30
Observation number
0
5
10
15
20
25
30
Observation number
Box-Cox transformation plot
-100
-150
log-Likelihood
-50
95%
-2
-1
0
1
2
> save.it$lambda.hat
[1] 0.67
> save.it$bp
Breusch-Pagan test
data: mod.fit.obj
BP = 1.7776, df = 2, p-value = 0.4111
> save.it$levene
[,1]
[,2]
[,3]
[1,]
1 3.6493 0.06638479
[2,]
2 3.0876 0.08981992
i) Linearity of the regression model
Plots of the residuals versus each of the predictor variables contain a random scattering of
points. Therefore, no transformations are suggested by the plots.
ii) Constant error variance
The plot of ei vs. Ŷi contains a random scattering of points. There is not any indication of
non-constant error variance. The BP test results in a p-value of 0.41 indicating there is not
sufficient evidence to indicate non-constant error variance. The Levene’s test results in
7
marginally significant p-values suggesting there may be problems with the non-constant
error variance assumption, but the evidence is not strong. The 95% confidence interval for
 from the Box-Cox transformation appears to have an upper bound of approximately 1.
Also, ̂ = 0.67. Thus, there again is some evidence of non-constant error variance, but it is
not strong.
iii) Outliers
The plots of ri vs. Ŷi and ti vs. Ŷi show only one observation (#15) greater than t[1-0.01/2;
n-p-1], but not greater than the Bonferroni corrected version t[1-0.01/(2n); n-p-1]. Note that
it would not be unusual to have one observation out of 30 outside of t[1-0.01/2; n-p-1].
iv) Influential observations
The plots for DFFITS, Cook’s D, and DFBETAS were examined to determine if any
observations were influential. No observations are outside of the criteria given for small to
medium sized data sets. Therefore, I do not have concerns about influential observations.
v) Normality of i
The QQ-plot of ei has most of its points lying on the straight line, with a few deviations in
the right side. Overall, this plot does not provide sufficient evidence against normality to
warrant a change in the model. The histogram has a somewhat mound shape like a normal
distribution, but it may be a little right skewed. However, with only a sample size of the 30,
this is not necessarily surprising. Overall, this plot also does not provide sufficient evidence
against normality to warrant a change in the model.
e) (5 points) Part d) will suggest one appropriate change to the model from b). Make the change
and comment on the five items listed in d) again relative to this new model. Note that you do
not need to include plots here, but make sure to include your code.
Given the results from the Box-Cox transformation calculations, a Y0.67 transformation may
help with the constant variance problem. Because 0.5 is within the corresponding interval for ,
I will use a square root transformation here instead because it is more interpretable. Below is
the code and some of the output:
> mod.fit3<-lm(formula = taste^0.5 ~ H2S + Lactic, data = cheese)
> summary(mod.fit3)
Call:
lm(formula = taste^0.5 ~ H2S + Lactic, data = cheese)
Residuals:
Min
1Q
-2.50090 -0.54959
Median
0.04868
3Q
0.81128
Max
2.26337
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9257
1.0267 -0.902 0.37524
H2S
0.4449
0.1298
3.427 0.00197 **
Lactic
2.0182
0.9098
2.218 0.03514 *
---
8
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.137 on 27 degrees of freedom
Multiple R-squared: 0.6266,
Adjusted R-squared: 0.5989
F-statistic: 22.65 on 2 and 27 DF, p-value: 1.676e-06
> save.it3<-examine.mod.multiple.final(mod.fit.obj = mod.fit3, first.order = 2,
const.var.test = TRUE, boxcox.find = TRUE)
> save.it$bp
Breusch-Pagan test
data: mod.fit.obj
BP = 1.7776, df = 2, p-value = 0.4111
> save.it3$lambda.hat
[1] 1.33
> save.it3$levene
[,1]
[,2]
[,3]
[1,]
1 0.0296 0.8646667
[2,]
2 0.0007 0.9786681
> save.it3$bp
Breusch-Pagan test
data: mod.fit.obj
BP = 1.3625, df = 2, p-value = 0.506
The transformation appears to have helped with the constant variance assumption. The
Levene’s tests all have large p-values and  = 1 is within the confidence interval given by the
Box-Cox procedure. Also, the histogram and QQ-plot look closer to normal than they did
before the transformation. There are no outliers or influential observations shown on the
corresponding plots. Also, there is no evidence of a transformation needed for the predictor
variables. Overall, the model looks good!
Similar findings result from using a Y0.67 transformation.
9
Download