Project #2 Answers STAT 870 Fall 2012 Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document. 1) (27 total points) Continue using taste as the response variable and acetic acid as the predictor variable as in project #1 for the cheese data. Complete the following. a) (2 points) Give the ANOVA table. > library(RODBC) > z<-odbcConnectExcel(xls.file = "C:\\chris\\unl\\Dropbox\\NEW\\ STAT870\\projects\\Fall2012\\cheese.xls") > cheese<-sqlFetch(channel = z, sqtable = "Sheet1") > close(z) > head(cheese) Case taste Acetic H2S Lactic 1 1 12.3 4.543 3.135 0.86 2 2 20.9 5.159 5.043 1.53 3 3 39.0 5.366 5.438 1.57 4 4 47.9 5.759 7.496 1.81 5 5 5.6 4.663 3.807 0.99 6 6 25.9 5.697 7.601 1.09 > mod.fit1<-lm(formula = taste ~ Acetic, data = cheese) > summary(mod.fit1) Call: lm(formula = taste ~ Acetic, data = cheese) Residuals: Min 1Q -29.642 -7.443 Median 2.082 3Q 6.597 Max 26.581 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -61.499 24.846 -2.475 0.01964 * Acetic 15.648 4.496 3.481 0.00166 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.82 on 28 degrees of freedom Multiple R-squared: 0.302, Adjusted R-squared: 0.2771 F-statistic: 12.11 on 1 and 28 DF, p-value: 0.001658 > anova(mod.fit1) Analysis of Variance Table Response: taste Df Sum Sq Mean Sq F value Pr(>F) Acetic 1 2314.1 2314.14 12.114 0.001658 ** Residuals 28 5348.7 191.03 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 1 > var(cheese$taste)*(nrow(cheese) - 1) [1] 7662.887 > save.anova<-anova(mod.fit1) > names(save.anova) [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)" > save.anova$"Sum Sq" [1] 2314.142 5348.745 > sum(save.anova$"Sum Sq") [1] 7662.887 Source of variation Regression Error Total df 1 28 29 SS MS F 2314.1 2314.14 12.114 5348.7 191.03 7662.8 Note that there 2314.1 + 5348.7 = 7662.8 in the ANOVA table although SSTO is actually 7662.9 when rounded to one decimal place. The difference is due to rounding error. b) (3 points) Using the relevant information from the ANOVA table, perform an F-test for 1 = 0 vs. 1 0. Use = 0.05. i) H0: 1 = 0 vs. Ha: 1 0 ii) F = 12.114, p-value = 0.0017 iii) = 0.05 iv) Because 0.0017 < 0.05, reject H0 v) There is sufficient evidence of a linear relationship between taste and acetic acid. c) (3 points) What is R2 for the sample regression model? Fully interpret its value. R2 = 0.30 as given in the summary(mod.fit1) output. Approximately 30% of the variation in taste is accounted for by the acetic acid value. d) (8 points) Using my examine.mod.simple() function, comment on the following items with regards to the model: i) Linearity of the regression function ii) Constant error variance iii) Normality of i iv) Outliers Make sure to specifically refer to plots and numerical values in your comments. > save.it1<-examine.mod.simple(mod.fit.obj = mod.fit1, const.var.test = TRUE, boxcox.find = TRUE) > save.it1$levene Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 1 4.5572 0.04167 * 28 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > save.it1$bp 2 Breusch-Pagan test data: Y ~ X BP = 5.4974, df = 1, p-value = 0.01905 > save.it1$lambda.hat [1] 0.6 Box plot Residuals vs. predictor Residuals -20 -10 -30 6.5 6.0 5.5 5.0 Predictor variable Predictor variable Residuals vs. estimated mean response ei vs. estimated mean response * 25 35 30 40 Density 15 20 25 1 0 20 25 30 35 40 0.0 0.1 0.2 0.3 0.4 0.5 0.6 20 10 0 Residuals -20 -10 10 15 Histogram of semistud. residuals -30 5 10 Estimated mean response Estimated mean response 0 -1 -3 20 15 10 Residuals vs. observation number -2 Semistud. residuals 2 20 Residuals -20 -10 -30 0 10 50 40 30 20 Response variable 10 0 40 30 20 10 0 30 -2 Observation number -1 0 1 2 Semistud. residuals Normal Q-Q Plot Box-Cox transformation plot -100 -180 -140 log-Likelihood 1 0 -1 Semistud. residuals -60 2 95% -2 Response variable 50 3 Dot plot 4.5 6.5 6.0 5.5 5.0 4.5 Box plot 0 10 50 40 30 20 Response variable 0 4.5 10 6.0 5.5 Predictor variable 20 6.5 Response vs. predictor 5.0 6.0 5.5 5.0 4.5 Predictor variable 6.5 Dot plot -2 -1 0 1 2 -2 -1 0 1 2 Theoretical Quantiles i) Linearity of the regression function – The plot of the residuals vs. acetic acid shows no pattern among plotting points. This indicates the linearity assumption is reasonable. ii) Constant error variance – The plot of the residuals vs. the estimated mean response show less variability at the smaller values of estimated mean response than at larger values. 3 However, there are much fewer observations at the smaller values so concluding nonconstant error variance is difficult. The BP test has a p-value of 0.0191 and Levene’s test has a p-value of 0.0417, so there is marginal evidence of non-constant variance from these hypothesis tests. The Box-Cox transformation plot shows the upper bound for the 95% confidence interval value for does not contain 1, but it is quite close. Again, there is marginal evidence then of non-constant variance. iii) Normality of i – The normal QQ-plot has some deviation from the straight line toward the tails of the distribution. The histogram of the semi-studentized residuals also has possible deviation from normality. However, the sample size is only 30 so it may be difficult to assess normality with this size of a sample. iv) Outliers – A plot of the semistudentized residuals vs. the estimated potency shows no points outside of the 3 bounds. Therefore, there are no outliers. e) (6 points) You should detect one or more potential problems with the model through the work in part d). Transform the response variable to be taste . Why do you think this transformation was chosen? Determine if the transformation helps to solve a problem with the model. This transformation was chosen because ̂ was estimate to be 0.6. Rather than choosing 0.6, I chose 0.5 because it results in a more meaningful transformation (square root). > mod.fit2<-lm(formula = sqrt(taste) ~ Acetic, data = cheese) > summary(mod.fit2) Call: lm(formula = sqrt(taste) ~ Acetic, data = cheese) Residuals: Min 1Q -3.5072 -0.9050 Median 0.4291 3Q 0.8931 Max 2.3050 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.5639 2.7805 -1.641 0.11191 Acetic 1.6719 0.5031 3.323 0.00249 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.547 on 28 degrees of freedom Multiple R-squared: 0.2828, Adjusted R-squared: 0.2572 F-statistic: 11.04 on 1 and 28 DF, p-value: 0.002489 > save.it2<-examine.mod.simple(mod.fit.obj = mod.fit2, const.var.test = TRUE, boxcox.find = TRUE, Y = sqrt(cheese$taste)) > save.it2$levene Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 1 0.5849 0.4508 28 > save.it2$bp Breusch-Pagan test data: Y ~ X BP = 1.0556, df = 1, p-value = 0.3042 4 > save.it2$lambda.hat [1] 1.19 Response vs. predictor Residuals vs. predictor 1 0 -3 1 4.5 4.5 2 -2 -1 Residuals 6 5 4 3 Response variable 6.0 5.5 Predictor variable 5.0 6.0 5.5 5.0 Predictor variable 7 2 6.5 Dot plot 6.5 Box plot 4.5 Dot plot 5.5 6.0 6.5 4.5 5.0 5.5 6.0 6.5 Predictor variable Predictor variable Residuals vs. estimated mean response ei vs. estimated mean response * 2 1 0 -1 Semistud. residuals -3 -3 -2 1 0 -2 -1 Residuals 6 5 4 3 2 1 Response variable 6 5 4 3 2 1 3.0 3.5 4.0 4.5 5.0 5.5 6.0 3.0 Estimated mean response Density 5 10 15 20 25 30 -2 Observation number 1 2 -80 -70 -60 -50 -40 -30 -20 log-Likelihood 1 0 -1 0 5.0 5.5 6.0 -1 0 1 Box-Cox transformation plot -2 -1 4.5 Semistud. residuals Normal Q-Q Plot -2 4.0 Estimated mean response 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 0 -1 -3 -2 Residuals 0 3.5 Histogram of semistud. residuals 2 Residuals vs. observation number Semistud. residuals Response variable 2 7 7 3 Box plot 5.0 95% -2 -1 0 1 2 Theoretical Quantiles The transformation appears to help solve the non-constant error variance problem, although there is still a little less variability for smaller values of Ŷ than for larger values. The 95% confidence interval for now contains 1. Also, both the BP and Levene’s tests have large pvalues. f) (3 points) Using the new model that was estimated for part e), find the 95% confidence intervals for taste (not taste ) when acetic acid has a value of 4.5 and 6.4. Compare the 5 intervals to those found in project #1. Which intervals (project #1 or #2) are more likely to have 95% confidence? Explain. The purpose of this problem is to make sure you understand how to find the response variable in its original form. Also, I wanted you to think about what happens if model assumptions are not satisfied. > pred1<-predict(object = mod.fit1, newdata = data.frame(Acetic = c(4.5, 6.4)), level = 0.95, interval = "confidence") > pred2<-predict(object = mod.fit2, newdata = data.frame(Acetic = c(4.5, 6.4)), level = 0.95, interval = "confidence") > pred1 fit lwr upr 1 8.91634 -1.628502 19.46118 2 38.64710 28.863754 48.43044 > pred2^2 fit lwr upr 1 8.759054 3.166644 17.13656 2 37.652245 25.414686 52.28717 The 95% confidence intervals for taste (using the transformation-based model) are 3.17 < E(taste) < 17.14 when acetic acid is 4.5 and 25.41 < E(taste) < 52.29 when acetic acid is 6.4. In project #1, the corresponding intervals were -1.63 < E(taste) < 19.46 and 28.86 < E(taste) < 48.43, so we do see some differences. The new intervals for this project are more likely to have 95% confidence because the model’s assumptions are closer to being satisfied. g) (2 points) Are there any other problems with the model after what was done in part e)? Justify your answer. Note that you do not need to actually implement any changes to the model. There still may be problems with the normality of taste . The normal QQ-plot has some deviation from the straight line toward the tails of the distribution. The histogram of the semistudentized residuals also has possible deviation from normality. However, the sample size is only 30 so it may be difficult to assess normality with this size of a sample. 2) (12 total points) The extra credit of project #1 asked you to simulate data from a sample regression model using taste as the response variable (not transformed) and acetic acid as the predictor variable. Complete the following using my simulated data from the answer key. a) (1 point) Run my code to simulate the data. Give the simulated Y values to show that you simulated the data correctly. > sum.fit1<-summary(mod.fit1) > set.seed(7172) > Y.star<-mod.fit1$coefficients[1] + mod.fit1$coefficients[2]*cheese$Acetic + rnorm(n = 30, mean = 0, sd = sum.fit1$sigma) > Y.star [1] -10.310369 19.442977 46.226673 22.374284 1.770760 39.405165 28.965504 37.815467 14.692547 4.394592 [11] 47.148690 49.616239 21.646524 12.658015 27.213700 15.769941 12.332883 14.699950 27.294941 14.316974 [21] 5.417620 2.693531 24.563771 27.522578 26.537968 52.735022 42.552976 19.317254 30.032733 36.492530 6 b) (8 points) Estimate the appropriate regression model for the simulated data. Using my examine.mod.simple() function, comment on the following items with regards to the model: i) Linearity of the regression function ii) Constant error variance (do not examine the Box-Cox transformation value) iii) Normality of i iv) Outliers Make sure to specifically refer to plots and numerical values in your comments. > mod.fit.sim<-lm(formula = Y.star ~ cheese$Acetic) > summary(mod.fit.sim) Call: lm(formula = Y.star ~ cheese$Acetic) Residuals: Min 1Q -24.523 -6.374 Median 0.103 3Q 4.641 Max 24.887 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -80.467 20.509 -3.924 0.000516 *** cheese$Acetic 18.973 3.711 5.113 2.04e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.41 on 28 degrees of freedom Multiple R-squared: 0.4828, Adjusted R-squared: 0.4643 F-statistic: 26.14 on 1 and 28 DF, p-value: 2.038e-05 > save.it.sim<-examine.mod.simple(mod.fit.obj = mod.fit.sim, const.var.test = TRUE, boxcox.find = TRUE) > save.it.sim$levene Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 1 0.1333 0.7178 28 > save.it.sim$bp Breusch-Pagan test data: Y ~ X BP = 0.0608, df = 1, p-value = 0.8052 7 -20 5.0 5.5 6.0 6.5 5.0 5.5 6.0 6.5 Predictor variable Residuals vs. estimated mean response ei vs. estimated mean response 3 * 0 -1 Semistud. residuals -3 -2 10 0 Residuals -10 -20 1 2 20 50 20 10 0 -10 4.5 Predictor variable 40 30 Response variable 40 30 20 10 0 -10 10 20 30 40 10 Estimated mean response 30 40 Estimated mean response Histogram of semistud. residuals 0.1 0.2 Density 0 -10 0.0 -20 Residuals 10 0.3 20 Residuals vs. observation number 20 0 5 10 15 20 25 30 Observation number -3 -2 -1 0 1 2 3 Semistud. residuals 1 0 -1 Semistud. residuals 2 Normal Q-Q Plot -2 Response variable 50 Dot plot 0 Residuals 20 0 -10 4.5 4.5 4.5 Box plot -10 30 10 40 20 50 Residuals vs. predictor 10 Response variable 6.0 5.0 5.5 Predictor variable 6.0 5.5 5.0 Predictor variable Response vs. predictor 6.5 Dot plot 6.5 Box plot -2 -1 0 1 2 Theoretical Quantiles i) Linearity of the regression function – The plot of the residuals vs. acetic acid shows no pattern among plotting points. This indicates the linearity assumption is reasonable. ii) Constant error variance – The plot of the residuals vs. the estimated mean response show similar levels of variability in the residuals. This indicates that there is not sufficient evidence against the constant variance assumption. The BP test has a p-value of 0.81 and Levene’s test has a p-value of 0.72, so there again is not sufficient evidence against the constant variance assumption. iii) Normality of i – The normal QQ-plot has very little deviation from the straight line in the tails. The histogram of the semi-studentized residuals roughly follows a normal distribution, especially for a sample of size 30 only. 8 iv) Outliers – A plot of the semistudentized residuals vs. the estimated potency shows no points outside of the 3 bounds. Therefore, there are no outliers. c) (3 points) Answer one of the following: i) If you found potential problems, what could be a reason for them? Because we simulated the data, we know the model is correct and that there should not be any problems. Why did we perhaps have some problems detected then with the QQ-plot? One potential reason could be due to a sample size of only 30. ii) If you did not find potential problems, why is this expected? Based on what I saw in the plots and the summary measures, I would have been satisfied with the model. This is expected because we simulated the data with all of the correct assumptions! Out of curiosity, I increased the sample size to 900 using the following code. > set.seed(7172) > Y.star2<-mod.fit1$coefficients[1] + mod.fit1$coefficients[2]*rep(cheese$Acetic, times = 30) + rnorm(n = 900, mean = 0, sd = sum.fit1$sigma) > mod.fit.sim2<-lm(formula = Y.star2 ~ rep(cheese$Acetic, times = 30)) > summary(mod.fit.sim2) Call: lm(formula = Y.star2 ~ rep(cheese$Acetic, times = 30)) Residuals: Min 1Q -41.952 -9.361 Median -0.046 3Q 8.797 Max 43.113 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -67.6876 4.4980 -15.05 <2e-16 *** rep(cheese$Acetic, times = 30) 16.7935 0.8139 20.63 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.7 on 898 degrees of freedom Multiple R-squared: 0.3216, Adjusted R-squared: 0.3209 F-statistic: 425.8 on 1 and 898 DF, p-value: < 2.2e-16 > save.it.sim<-examine.mod.simple(mod.fit.obj = mod.fit.sim2, const.var.test = TRUE) 9 Dot plot Dot plot 40 Residuals -20 -40 5.0 5.5 6.0 6.5 4.5 5.5 6.0 6.5 Predictor variable Residuals vs. estimated mean response ei vs. estimated mean response * 1 0 -1 Semistud. residuals -3 -2 20 0 Residuals -20 -40 15 20 25 30 35 40 10 Estimated mean response 15 20 25 30 35 40 Estimated mean response Histogram of semistud. residuals 0.2 0.0 -40 -20 0.1 0 Density 20 0.3 40 0.4 Residuals vs. observation number Residuals 2 3 40 60 40 20 0 Response variable -20 10 0 200 400 600 800 -3 Observation number -2 -1 0 1 2 3 Semistud. residuals 2 1 0 -1 -3 Semistud. residuals 3 Normal Q-Q Plot -2 60 40 20 0 -20 Response variable 5.0 Predictor variable 80 Box plot 0 20 60 40 20 -20 0 Response variable 6.0 5.5 Predictor variable 4.5 4.5 80 Residuals vs. predictor 80 6.5 Response vs. predictor 5.0 6.0 5.5 5.0 4.5 Predictor variable 6.5 Box plot -3 -2 -1 0 1 2 3 Theoretical Quantiles Notice the histogram has a much closer shape to a normal distribution. Also, the QQ-plot has almost all of its points very close to the straight line. Thus, normality looks to be satisfied. This occurs because the larger sample size makes it easier to differentiate non-normal from normal. Also, notice there are 4 semistudentized residuals just outside of the -3 and 3. For a normal distribution, we would expect about > 900 - 900*(pnorm(q = 3) - pnorm(q = -3)) [1] 2.429816 to be outside of the boundaries. Thus, even though we have some values just outside of these boundary lines, this is to be expected. In actual application, one should be worried about a model if there were a considerable number more than 2.4 outside and if some were much farther outside than what we observe here. 10