Christopher Knapp University of Akron, Fall 2011 Regression Take-home Portion of Test 1 Contents Problem 1 Page 1 Page 3 Original Model Applying Transformations to the Model Problem 2 Page 4 Page 5 Page 6 Page 7 Set 1 Analysis Set 2 Analysis Set 3 Analysis Set 4 Analysis Problem 1 Problem 1 Take-home Portion of Test 1 P a g e |1 Original Model I used JMP to run a regression model on the original variables TEMP and FAIL. The fitted line is displayed below with parameter estimates. The equation of the fitted line is given by Ỹ=-22.842+6.93X. Notice that the p-value for the slope parameter (testing H0: β1 = 0) is .0015. At a 5% significance level, the conclusion is that β1 ≠ 0; therefore, there is strong evidence that a direct relationship exists between the variables. The Coefficient of Determination (R2) measures how much of the variation can be explained by the regression line. For this model, R2 is .437, which is pretty good. Notice that the p-value for the intercept parameter (testing H0: β0 = 0) is .579, so there is not enough evidence to claim β0 ≠ 0. This information is not as interesting as the slope parameter, because a temperature of 0 degrees is out of our scope. A residual analysis will assess the adequacy of model assumptions. The graph to the right plots the errors against the fitted Y values. It is clear that the variance of the error terms is not constant. As the fitted Y value increases, so does the variance. The histogram to the right of this plot appears to be normally distributed. When the histogram is plotted beside a normal curve, there is a strong indication of normality. The boxplot does reveal an outlier, but with a sample size of 20, this is not an issue. The quantitative approach to testing normality is given by the Shapiro-Wilk Test below. This test is associated with the null hypothesis H0: the data is normally distributed. With a p-value of .4483, there is evidence that the data comes from a normal distribution. Problem 1 Take-home Portion of Test 1 P a g e |2 Notice (from the plot below) the errors follow a slight systematic pattern as they vary with row number. Errors tend to be negative for early rows (7 out of the first 9) and errors tend to be positive for the later rows (6 out of the last 8). This suggests a “time” variable may be needed in the regression analysis; however, such an addition is not within the scope of this test. Lastly, the lack of fit test (testing the hypothesis H0: A linear model is appropriate) returns a pvalue of .0678. This indicates support of a linear model for α=.05, but the p-value is not much larger than .05. A transformation will likely improve this result. In conclusion, the non-constant variance is a problem and can be solved with a transformation on the Y variable. A transformation on the X variable may be applied afterwards if linearity is compromised. The next section uses the Box-Cox and Box-Tidwell transformations to create a better model. Problem 1 Take-home Portion of Test 1 Applying Transformations to the Model I used the transreg procedure in SAS to find the best lambda value for transforming Y. The output is displayed to the right and the code is displayed below: proc transreg data=p1 ss2 details; title2 'Defaults'; model boxcox(y) = identity(x); run; Notice that the optimal lambda value is ¼. Transforming the Y values with YY¼ produces the regression line displayed to the right. It appears that the algorithm worked – that is, the errors seem to have a constant variance; plotting the errors over the fitted y values (graphed below the regression line) emphasizes this. Notice that normality of errors is not as obvious as it was before. Applying the Shapiro-Wilk Test, the p-value (.0866) supports normality at a level of α=.05. The last assumption to check is for the appropriateness of the linear model. Applying a lack of fit test, there is evidence that a linear model is appropriate (p-value of .243). Notice this is much higher than the original model. The model is stronger than the original in terms of R2 (from .437 to .539) and the p-value for H0: β1=0 (from .0015 to .0002). The application of the Box-Tidwell algorithm to the right confirms that no power transformation of X will improve the model. Therefore the best model is: (Ỹ)¼ = 1.5311048 + .0702863X. P a g e |3 Problem 2 Problem 2 Take-home Portion of Test 1 Page |4 Set 1 Analysis The fitted regression line is graphed to the right. The equation is given by Ỹ1 = 3 + (.5)X1. That is, when X1=0, Y1 is expected to be 3, and for every unit increase in X1, a .5 unit increase in Y1 is expected. The R2 value is .667, which indicates a fairly strong relationship between X1 and Y1. The p-value for testing H0: β1=0 is .0022, so there is strong evidence that β1≠0. The high value of R2 and the conclusion for β1≠0 imply that this model does a good job predicting Y1 from X1. A similar test for H0: β0=0 results in a pvalue of .0257, therefore there is evidence that β0≠0. The assumptions of the simple linear regression model appear to hold. Notice that the residuals are independent of row number, have constant variance over predicted values for Y 1, contain no outliers, and test positive for normality (Shapiro-Wilk Test for hypothesis H0: Normality of error terms resulted in a p-value of .5456). Lastly, note that the ANOVA lack-of-fit test is not applicable because there are no duplicates. Therefore I applied the test after the following transformation on X: 4,54.5, 6,76.5, 8,98.5, 9,109.5, and 11,1211.5. The resulting p-value of .9717 implies strong evidence that the linear model is the appropriate one to use. Therefore, all the assumptions of the simple linear regression model have been met, and the line Ỹ1 = 3 + (.5)X1 is appropriate. Problem 2 Take-home Portion of Test 1 Set 2 Analysis The fitted line for this data is graphed to the right with Ỹ2 = 3 + (.5)X2 (this is the same line given in set 1). From the curvature of the plot, it is clear that this model is inappropriate. The line overestimates large and small values of X2 and underestimates all other values for X2. Another issue is that the plot increases, peaks, and decreases. This peak, followed by decreasing values of Y2, indicates that a simple power transformation is not sufficient. In fact, suggestions given by the Box-Cox and Box-Tidwell algorithms are unsatisfactory. Therefore, the correct choice is to abandon the linear regression model all together. Notice that the polynomial Ỹ2 = 4.268 + (.5)X2 - (.127)(X2-9)2 is a near-perfect fit, with an R2 value of .999999. This is equivalent to a multiple regression model with dependent variables X2 and (X2-9)2; however, this topic is outside of the scope for exam 1. Page |5 Problem 2 Take-home Portion of Test 1 Set 3 Analysis The plotted data is displayed to the right. Notice one of the high leverage points is an outlier in the Y distribution. All other points follow a very high strength, linear model. One of two tactics can be applied with data like this: (1) eliminate the outlier if other information reveals it as bad data or (2) apply a transformation with low power to lessen the effect of the outlier on the model. My analysis will apply both methods. Method 1: Eliminate the Outlier The line Ỹ3 = 4.006 + (.345)X3 results in a high R2 value – describing the line as extremely strong. This nearperfect fit does not require further analysis. Method 1: Apply a Transformation Because there is an outlier present, the Box-Cox algorithm will be applied to lessen its effect on the model. The algorithm returned a convenient lambda value of -2 and a best lambda value of -2.5. The difference in R2 between these two transformations is .0065, which is small enough to justify the application of the convenient lambda. The resulting equation is (Ỹ3)-2 = .042 - (.002)X3. Notice the large R2 value of .923. This line does a good job describing the data (which includes the outlier). However, notice the upward concavity of this plot. A Box-Tidwell transformation will result in a better fit – the SAS output is displayed below. Page |6 Problem 2 Take-home Portion of Test 1 Page |7 The Box-Tidwell algorithm resulted with a lambda of .3535534. I applied the transformation of .35 to the X values, resulting in the regression line: (Ỹ3)-2 = .078 - (.027)(X3).35. Notice that the tests for β0=0 and β1=0 both rejected the null hypothesis with p-values less than .0001. In particular, the rejection of β1=0 with such a low p-value indicates that this model is significant and that a relationship really does exist between the two variables. Notice that the slope of the regression line is negative – the reason it changed is due to the negative power of the Y transformation. Furthermore, the R2 value of the final model is very high, which indicates that our model did a good job explaining the variation in the data. Unfortunately, an issue does arise in confirming the assumptions behind the model. From the graph on the right, constant variance is not satisfied. Variance tends to decrease for larger values of predicted Y. Further analysis of the distribution of error terms reveals an outlier and a skewed distribution. The Shapiro-Wilk test for H0: Errors came from a Normal Distribution resulted in a p-value of .005, so normality can be rejected with a very high confidence. The test for appropriateness of a linear model can be applied by making use of the transformation displayed to the right. The result is multiple Y values for several X values, which allows the lack of fit test to be applied. From the table below, the pvalue of this test (with H0: linear model is appropriate) is .795. This high p-value suggests that the linear model is appropriate. Problem 2 Take-home Portion of Test 1 Set 4 Analysis The fitted regression line for set 4 is displayed to the right. Notice that there are only two distinct values for x, so there is no way of telling if a linear model is appropriate. Furthermore, constant variance is difficult to determine since there is only one point for the larger x-value. Therefore the only assumptions that can be checked are normality and independence of errors. The first graph to the right displays error values over the observation number. If the observations are thought of as time variables, there is a possible seasonal impact on the errors, because they appear to follow a sinusoidal pattern. However, more information on this dataset would be needed to analyze this effect. The second graph to the right displays the distribution of error terms next to the normal curve. The Shapiro-Wilk test (where H0: Data came from normal distribution) results in a p-value of .78; therefore, there is evidence of normality of error terms. The final model is given by Ỹ4 = .3 - 5X4. Notice that the tests for β0=0 and β1=0 both rejected the null hypothesis with pvalues less than .05. In particular, the rejection of β1=0 with such a low p-value indicates that this model is significant and that a relationship really does exist between the two variables. Furthermore, the R2 value is fairly large (.667), implying the model does a good job of explaining the variance in Y. Page |8