Solution - Homework 2 Use the data set HW_2 to complete this assignment and regress Y on X 1. Create boxplots for both X and Y. Are there any outliers? No outliers identified. See boxplot below. Boxplot of Y, X Y 0.16 X 20 0.14 0.12 15 0.10 0.08 10 0.06 5 0.04 0.02 0 0.00 2. Make a Scatterplot with Regression. Does there appear to be a linear relationship? Yes there does appear to be a linear relationship. Scatterplot of Y vs X 0.16 0.14 0.12 Y 0.10 0.08 0.06 0.04 0.02 0.00 0 5 10 X 15 20 1 3. Check for outliers using the semi-studentized method. Are there any outliers, if so what the absolute values of the semi-studentized residuals? No as all semi-studentized residuals have an absolute value less than four, and now points were identified as potential outliers. 4. Do a check of normality by using a probability plot of the residuals. Include: a) the null and alternative hypotheses, b) the p-value of the test, c) your decision based on a 0.05 level of significance, and d) Minitab copy of your plot. a) Ho: The residuals come from a normal distribution Ha: The residuals do not come from a normal distribution b) p-value is 0.033 c) Since p-value is less than 0.05 we reject Ho and will conclude the error terms are NOT normally distributed d) Probability Plot of RESI1 Normal - 95% CI 99 Mean -3.59012E-17 StDev 0.01228 N 23 AD 0.797 P-Value 0.033 95 90 Percent 80 70 60 50 40 30 20 10 5 1 -0.04 -0.03 -0.02 -0.01 0.00 0.01 RESI1 0.02 0.03 0.04 5. Do a check of equal variances by performing a Modified Levene Test, Breusch-Pagan Test, and White's Test. Include: a) the null and alternative hypotheses, b) the test statistic c) p-value and DF of the test (the df for the BP and White test only), and d) your decision based on a 0.05 level of significance. Modified Levene Test a) Ho: The variances are equal Ha: The variances are not equal b) Test statistic = 9.45 2 c) The p-value is 0.006 NOTE: Remember that the Levene’s test is more robust against violations to normality than is the F-test making the Levene test a better overall test of equal variances. The only condition for the Levene test is that the variable being tested is continuous. d) Since the p-value is less than 0.05 we conclude that the assumption of equal variances is NOT satisfied. Breusch-Pagan Test a) Ho: All slopes are equal to zero Ha: At least one slope differs from 0 b) Test statistic F = 16.38 c) The DF = 1,21 and p-value is 0.001 d) Since the p-value is less than 0.05 we conclude that the assumption of equal variances is NOT satisfied. White's Test a) Ho: All slopes are equal to zero Ha: At least one slope differs from 0 b) Test statistic F = 10.90 c) The DF = 2, 20 p-value is 0.001 d) Since the p-value is less than 0.05 we conclude that the assumption of equal variances is NOT satisfied. 6. Perform a Lack of Fit Test using both Pure Error and Data Subsetting to check if linear regression function is appropriate. Include: a) the null and alternative hypotheses, b) the correct F-statistic, DF and p-value of the Pure Error test, c) the results of the Data Subsetting test, and d) your decision based on a 0.05 level of significance, and e) Minitab copy of your ANOVA output and the Data Subsetting results. a) Ho: The linear regression function is appropriate Ha: The linear regression function is not appropriate 3 b) F-statistic is 0.51, DF = 2, 19 and p-value is 0.610 c) The p-value for data subsetting is 0.000 indicating the linear model is not a good fit. d) Since the Pure Error p-value is greater than 0.05 we conclude that the error is due more to random variation within each X than to lack of model fit. However, the low p-value for the data subsetting comes from possible curvature in the model. e) Analysis of Variance Source Regression Residual Error Lack of Fit Pure Error Total DF 1 21 2 19 22 SS 0.036190 0.003319 0.000168 0.003151 0.039509 MS 0.036190 0.000158 0.000084 0.000166 F 229.00 P 0.000 0.51 0.610 R denotes an observation with a large standardized residual. Possible lack of fit at outer X-values (P-Value = 0.000) Overall lack of fit test is significant at P = 0.000 7. Perform a Box-Cox analysis on Y to see if any transformation is suggested. Include the a) estimated and rounded lambda values, b) the interpretation of this value, and c) the Box-Cox plot. NOTE: This can only be done using Minitab Version 15 or higher – i.e. student version 14 does not contain Box-Cox program. a) Estimated value is 0.21 and rounded lambda is 0.00 b) The rounded value implies we should apply a log transformation on Y. c) Box-Cox Plot of Y Lower CL Upper CL Lambda 0.30 (using 95.0% confidence) 0.25 StDev 0.20 Estimate 0.21 Lower CL Upper CL -0.13 0.58 Rounded Value 0.00 0.15 0.10 0.05 Limit 0.00 -1 0 1 Lambda 2 3 4 8. Create a transformation of Y using the natural log. Using these transformed Y values check the assumption for normality and use the BP method to check constant variance. Include the a) hypotheses, b) test statistic, c) p-value and d) decision. Use 0.05 as level of significance. Normality a) Ho: The residuals come from a normal distribution Ha: The residuals do not come from a normal distribution b) AD = 0.355 c) p-value = 0.429 d) Since p-value is greater than 0.05 we fail to reject the null hypothesis. The assumption of normality is plausible. Variance a) Ho: All slopes are equal to zero Ha: At least one slope differs from 0 b) F = 0.82 c) p-value = 0.375 d) Since p-value is greater than 0.05 we fail to reject the null hypothesis. The assumption of constant variance is plausible. 9. Using the transformed Y-values conduct lack of fit tests using the Pure Error and Data Subsetting options. What is a) the p-value for both tests and b) the conclusion for both tests? Use alpha of 5%. a) The p-values for both tests is 0.000 b) With the p-value being less than 0.05 we reject the null hypothesis and conclude that the model is not a good fit and the variation in X is not due to random error. 10. Create a new, squared term for X by squaring each X value. Regress the transformed Y values on both the X and X-squared terms (i.e. put both these X terms in the predictor field in Minitab). Perform a lack of fit tests using the Pure Error and Data Subsetting options. What is a) the p-value for both tests and b) the conclusion for both tests, c) what is your overall general conclusion about model fit? Use alpha of 5%. a) The p-values for Pure Error is 0.018 and for Data Subsetting the p-value is greater than 0.1 b) With Pure Error p-value being less than 0.05 this indicates that adding to the model a squared term does not result in a well fitted model. Most likely more term(s) need to be added. Conversely, the data subsetting indicates that the squared term corrects for possible curvature in X. 5 c) In this case, the Pure Error test is comparing two models: one with and without the squared term concluding the simpler model is a better model fit than the multiple model. This conflicts with the data subsetting results that show the squared term corrects for curvature. Since we have replicates in X and have satisfied normality and constant variance with the natural log of Y, we will follow those results and conclude that the model is still not a good fit for the data. The best solution would be to find additional predictors. 6