Project #5 Answers STAT 870 Fall 2012 Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document. 1) (22 total points) The purpose of this problem is to find the best model for the cheese data set. a) (4 points) For the model containing acetic, H2S, and lactic in a linear form, construct added variable plots for each predictor variable. What are the proper forms for the predictor variables? > > > > > 1 2 3 4 5 6 library(RODBC) z<-odbcConnectExcel(xls.file = "C:\\data\\cheese.xls") cheese<-sqlFetch(channel = z, sqtable = "Sheet1") close(z) head(cheese) Case taste Acetic H2S Lactic 1 12.3 4.543 3.135 0.86 2 20.9 5.159 5.043 1.53 3 39.0 5.366 5.438 1.57 4 47.9 5.759 7.496 1.81 5 5.6 4.663 3.807 0.99 6 25.9 5.697 7.601 1.09 > mod.fit<-lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese) > summary(mod.fit) Call: lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese) Residuals: Min 1Q -17.390 -6.612 Median -1.009 3Q 4.908 Max 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 > #Added variable plots > library(car) Loading required package: MASS Loading required package: nnet > avPlots(model = mod.fit) 1 10 0 -20 -10 taste | others 10 0 -10 taste | others 20 20 Added-Variable Plots -0.5 0.0 0.5 -3 -1 0 1 2 3 H2S | others 10 0 -10 taste | others 20 Acetic | others -2 -0.4 -0.2 0.0 0.2 Lactic | others The plot for Acetic has a random scattering of points, so it does not appear that there is a relationship between Acetic and taste when including Lactic and H2S in the model. The plots for H2S and Lactic show linear trends suggesting that they are included correctly in the model as linear terms. b) (2 points) Through the results in the previous part and from project #4, it appears that acetic acid may not be important to include the model. Estimate the model with H2S and Lactic only where the variables are in a linear form. > mod.fit2<-lm(formula = taste ~ H2S + Lactic, data = cheese) > summary(mod.fit2) Call: lm(formula = taste ~ H2S + Lactic, data = cheese) Residuals: Min 1Q Median 3Q Max 2 -17.343 -6.530 -1.164 4.844 25.618 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -27.592 8.982 -3.072 0.00481 ** H2S 3.946 1.136 3.475 0.00174 ** Lactic 19.887 7.959 2.499 0.01885 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.942 on 27 degrees of freedom Multiple R-squared: 0.6517, Adjusted R-squared: 0.6259 F-statistic: 25.26 on 2 and 27 DF, p-value: 6.551e-07 The estimated model is taste 27.59 3.946H2S 19.887Lactic c) (3 points) Continuing with the model from part b), show that there is not sufficient evidence to indicate an interaction between H2S and Lactic is needed. Use = 0.05. > mod.fit2.inter<-lm(formula = taste ~ H2S + Lactic + H2S:Lactic, data = cheese) > summary(mod.fit2.inter) Call: lm(formula = taste ~ H2S + Lactic + H2S:Lactic, data = cheese) Residuals: Min 1Q -17.378 -6.296 Median -1.211 3Q 5.018 Max 25.810 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -23.187 27.749 -0.836 0.411 H2S 3.236 4.382 0.738 0.467 Lactic 16.725 20.479 0.817 0.422 H2S:Lactic 0.488 2.902 0.168 0.868 Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6521, Adjusted R-squared: 0.6119 F-statistic: 16.24 on 3 and 26 DF, p-value: 3.768e-06 H0: 3 = 0 vs. Ha: 3 0 p-value = 0.868 Because 0.868 > 0.05, do not reject H0. There is not sufficient evidence to indicate an interaction is needed between H2S and Lactic. d) (8 points) Using my examine.mod.multiple.final() function and the resulting model from b), comment on 1) linearity of the regression model, 2) constant error variance, 3) outliers, 4) Influential observations, and 5) normality of . > source(file = "C:\\examine.mod.multiple.final.R") > save.it<-examine.mod.multiple.final(mod.fit.obj = mod.fit2, first.order = 2, const.var.test = TRUE, boxcox.find = TRUE) 3 -10 0 Residuals 10 20 0 10 20 1.6 1.8 30 40 1.2 1.4 1.6 1.8 Predictor variable 2 1.2 1.4 30 2.0 30 Response variable 2.0 20 20 40 40 4 4 8 6 8 Predictor variable 1 6 Predictor variable 1 50 50 10 10 Dot plot 1.0 1.0 10 0 Predictor variable 2 10 0 Response variable Box plot Box plot Dot plot Box plot Dot plot Residuals vs. estimated mean response 50 Estimated mean response 4 ti vs. estimated mean response 2 1 0 -1 -3 -2 Studentized deleted residuals 2 1 0 -1 -2 -3 0 10 20 30 40 50 0 10 Estimated mean response 20 30 40 50 Estimated mean response 10 0 -10 -10 0 Residuals 10 20 Residuals vs. predictor 2 20 Residuals vs. predictor 1 Residuals Studentized residuals 15 3 3 ri vs. estimated mean response 4 6 8 Predictor variable 1 10 1.0 1.2 1.4 1.6 1.8 2.0 Predictor variable 2 5 Histogram of residuals 0.03 0.01 0.02 Density 10 0 0.00 -10 Residuals 20 Residuals vs. observation number 0 5 10 15 20 25 30 -20 Observation number -10 0 10 20 30 Residuals 10 0 -10 Residuals 20 Normal Q-Q Plot -2 -1 0 1 2 Theoretical Quantiles 15 0.0 12 -1.0 DFFITS 0.5 1.0 DFFITS vs. observation number 0 5 10 15 20 25 30 25 30 Observation number 0.6 0.4 0.2 0.0 Cook's D 0.8 Cook's D vs. observation number 0 5 10 15 20 Observation number 6 1.0 0.5 12 -1.0 -0.5 0.0 DFBETAS 1.0 DFBETAS for term 2 vs. observation number 0.5 0.0 -1.0 -0.5 DFBETAS DFBETAS for term 1 vs. observation number 0 5 10 15 20 25 30 Observation number 0 5 10 15 20 25 30 Observation number Box-Cox transformation plot -100 -150 log-Likelihood -50 95% -2 -1 0 1 2 > save.it$lambda.hat [1] 0.67 > save.it$bp Breusch-Pagan test data: mod.fit.obj BP = 1.7776, df = 2, p-value = 0.4111 > save.it$levene [,1] [,2] [,3] [1,] 1 3.6493 0.06638479 [2,] 2 3.0876 0.08981992 i) Linearity of the regression model Plots of the residuals versus each of the predictor variables contain a random scattering of points. Therefore, no transformations are suggested by the plots. ii) Constant error variance The plot of ei vs. Ŷi contains a random scattering of points. There is not any indication of non-constant error variance. The BP test results in a p-value of 0.41 indicating there is not sufficient evidence to indicate non-constant error variance. The Levene’s test results in 7 marginally significant p-values suggesting there may be problems with the non-constant error variance assumption, but the evidence is not strong. The 95% confidence interval for from the Box-Cox transformation appears to have an upper bound of approximately 1. Also, ̂ = 0.67. Thus, there again is some evidence of non-constant error variance, but it is not strong. iii) Outliers The plots of ri vs. Ŷi and ti vs. Ŷi show only one observation (#15) greater than t[1-0.01/2; n-p-1], but not greater than the Bonferroni corrected version t[1-0.01/(2n); n-p-1]. Note that it would not be unusual to have one observation out of 30 outside of t[1-0.01/2; n-p-1]. iv) Influential observations The plots for DFFITS, Cook’s D, and DFBETAS were examined to determine if any observations were influential. No observations are outside of the criteria given for small to medium sized data sets. Therefore, I do not have concerns about influential observations. v) Normality of i The QQ-plot of ei has most of its points lying on the straight line, with a few deviations in the right side. Overall, this plot does not provide sufficient evidence against normality to warrant a change in the model. The histogram has a somewhat mound shape like a normal distribution, but it may be a little right skewed. However, with only a sample size of the 30, this is not necessarily surprising. Overall, this plot also does not provide sufficient evidence against normality to warrant a change in the model. e) (5 points) Part d) will suggest one appropriate change to the model from b). Make the change and comment on the five items listed in d) again relative to this new model. Note that you do not need to include plots here, but make sure to include your code. Given the results from the Box-Cox transformation calculations, a Y0.67 transformation may help with the constant variance problem. Because 0.5 is within the corresponding interval for , I will use a square root transformation here instead because it is more interpretable. Below is the code and some of the output: > mod.fit3<-lm(formula = taste^0.5 ~ H2S + Lactic, data = cheese) > summary(mod.fit3) Call: lm(formula = taste^0.5 ~ H2S + Lactic, data = cheese) Residuals: Min 1Q -2.50090 -0.54959 Median 0.04868 3Q 0.81128 Max 2.26337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.9257 1.0267 -0.902 0.37524 H2S 0.4449 0.1298 3.427 0.00197 ** Lactic 2.0182 0.9098 2.218 0.03514 * --- 8 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.137 on 27 degrees of freedom Multiple R-squared: 0.6266, Adjusted R-squared: 0.5989 F-statistic: 22.65 on 2 and 27 DF, p-value: 1.676e-06 > save.it3<-examine.mod.multiple.final(mod.fit.obj = mod.fit3, first.order = 2, const.var.test = TRUE, boxcox.find = TRUE) > save.it$bp Breusch-Pagan test data: mod.fit.obj BP = 1.7776, df = 2, p-value = 0.4111 > save.it3$lambda.hat [1] 1.33 > save.it3$levene [,1] [,2] [,3] [1,] 1 0.0296 0.8646667 [2,] 2 0.0007 0.9786681 > save.it3$bp Breusch-Pagan test data: mod.fit.obj BP = 1.3625, df = 2, p-value = 0.506 The transformation appears to have helped with the constant variance assumption. The Levene’s tests all have large p-values and = 1 is within the confidence interval given by the Box-Cox procedure. Also, the histogram and QQ-plot look closer to normal than they did before the transformation. There are no outliers or influential observations shown on the corresponding plots. Also, there is no evidence of a transformation needed for the predictor variables. Overall, the model looks good! Similar findings result from using a Y0.67 transformation. 9