Project 1 - gozips.uakron.edu

advertisement
Christopher Knapp
University of Akron, Fall 2011
Regression
Take-home Portion of Test 1
Contents
Problem 1
Page 1
Page 3
Original Model
Applying Transformations to the Model
Problem 2
Page 4
Page 5
Page 6
Page 7
Set 1 Analysis
Set 2 Analysis
Set 3 Analysis
Set 4 Analysis
Problem 1
Problem 1
Take-home Portion of Test 1
P a g e |1
Original Model
I used JMP to run a regression model on the original variables TEMP and FAIL. The fitted line
is displayed below with parameter estimates. The equation of the fitted line is given by
Ỹ=-22.842+6.93X.
Notice that the p-value for the slope parameter (testing H0: β1 = 0) is .0015. At a 5% significance
level, the conclusion is that β1 ≠ 0; therefore, there is strong evidence that a direct relationship
exists between the variables. The Coefficient of Determination (R2) measures how much of the
variation can be explained by the regression line. For this model, R2 is .437, which is pretty
good. Notice that the p-value for the intercept parameter (testing H0: β0 = 0) is .579, so there is
not enough evidence to claim β0 ≠ 0. This information is not
as interesting as the slope parameter, because a temperature of
0 degrees is out of our scope.
A residual analysis will assess the adequacy of model assumptions.
The graph to the right plots the errors
against the fitted Y values. It is clear
that the variance of the error terms is
not constant. As the fitted Y value
increases, so does the variance. The
histogram to the right of this plot
appears to be normally distributed.
When the histogram is plotted beside a normal curve, there is a strong
indication of normality. The boxplot does reveal an outlier, but with a
sample size of 20, this is not an issue. The quantitative approach to
testing normality is given by the Shapiro-Wilk Test below. This test is
associated with the null hypothesis H0: the data is normally distributed.
With a p-value of .4483, there
is evidence that the data comes
from a normal distribution.
Problem 1
Take-home Portion of Test 1
P a g e |2
Notice (from the plot below) the errors follow a slight systematic pattern as they vary with row
number. Errors tend to be negative for early rows (7 out of the first 9) and errors tend to be
positive for the later rows (6 out of the last 8). This suggests a “time” variable may be needed in
the regression analysis; however, such an addition is not within the scope of this test.
Lastly, the lack of fit test (testing the hypothesis H0: A linear model is appropriate) returns a pvalue of .0678. This indicates support of a linear model for α=.05, but the p-value is not much
larger than .05. A transformation will likely improve this result.
In conclusion, the non-constant variance is a problem and can be solved with a transformation on
the Y variable. A transformation on the X variable may be applied afterwards if linearity is
compromised. The next section uses the Box-Cox and Box-Tidwell transformations to create a
better model.
Problem 1
Take-home Portion of Test 1
Applying Transformations to the Model
I used the transreg procedure in SAS to find the best
lambda value for transforming Y. The output is displayed
to the right and the code is displayed below:
proc transreg data=p1 ss2 details;
title2 'Defaults';
model boxcox(y) = identity(x);
run;
Notice that the optimal lambda value is ¼. Transforming
the Y values with YY¼ produces the regression line
displayed to the right. It appears that the algorithm
worked – that is, the errors seem to have a constant
variance; plotting the errors over the fitted y values
(graphed below the regression line) emphasizes this.
Notice that normality of errors is not as obvious as it was
before. Applying the Shapiro-Wilk Test, the p-value
(.0866) supports normality at a level of α=.05. The last
assumption to check is for the appropriateness of the
linear model. Applying a lack of fit test, there is evidence
that a linear model is appropriate (p-value of .243).
Notice this is much higher than the original model.
The model is stronger than the original in terms of R2 (from
.437 to .539) and the p-value for H0: β1=0 (from .0015 to
.0002). The application of the Box-Tidwell algorithm to the
right confirms that no power transformation of X will improve
the model.
Therefore the best model is: (Ỹ)¼ = 1.5311048 + .0702863X.
P a g e |3
Problem 2
Problem 2
Take-home Portion of Test 1
Page |4
Set 1 Analysis
The fitted regression line is graphed to the right. The equation is
given by Ỹ1 = 3 + (.5)X1. That is, when X1=0, Y1 is expected to be
3, and for every unit increase in X1, a .5 unit increase in Y1 is
expected. The R2 value is .667, which indicates a fairly strong
relationship between X1 and Y1. The p-value for testing H0: β1=0 is
.0022, so there is strong evidence that β1≠0. The high value of R2
and the conclusion for β1≠0 imply that this model does a good job
predicting Y1 from X1. A similar test for H0: β0=0 results in a pvalue of .0257, therefore there is evidence that β0≠0.
The assumptions of the simple linear regression model appear to hold. Notice that the residuals
are independent of row number, have constant variance over predicted values for Y 1, contain no
outliers, and test positive for normality (Shapiro-Wilk Test for hypothesis H0: Normality of error
terms resulted in a p-value of .5456).
Lastly, note that the ANOVA lack-of-fit test is not
applicable because there are no duplicates. Therefore I
applied the test after the following transformation on X:
4,54.5,
6,76.5,
8,98.5,
9,109.5, and
11,1211.5.
The resulting p-value of .9717 implies strong evidence that
the linear model is the appropriate one to use.
Therefore, all the assumptions of the simple linear
regression model have been met, and the line
Ỹ1 = 3 + (.5)X1 is appropriate.
Problem 2
Take-home Portion of Test 1
Set 2 Analysis
The fitted line for this data is graphed to the right with
Ỹ2 = 3 + (.5)X2 (this is the same line given in set 1). From the
curvature of the plot, it is clear that this model is inappropriate. The
line overestimates large and small values of X2 and underestimates
all other values for X2. Another issue is that the plot increases,
peaks, and decreases. This peak, followed by decreasing values of
Y2, indicates that a simple power transformation is not sufficient. In
fact, suggestions given by the Box-Cox and Box-Tidwell algorithms
are unsatisfactory. Therefore, the correct choice is to abandon the
linear regression model all together.
Notice that the polynomial Ỹ2 = 4.268 + (.5)X2 - (.127)(X2-9)2 is a
near-perfect fit, with an R2 value of .999999. This is equivalent to a
multiple regression model with dependent variables X2 and (X2-9)2;
however, this topic is outside of the scope for exam 1.
Page |5
Problem 2
Take-home Portion of Test 1
Set 3 Analysis
The plotted data is displayed to the right. Notice one of the high
leverage points is an outlier in the Y distribution. All other points
follow a very high strength, linear model. One of two tactics can
be applied with data like this: (1) eliminate the outlier if other
information reveals it as bad data or (2) apply a transformation
with low power to lessen the effect of the outlier on the model.
My analysis will apply both methods.
Method 1: Eliminate the Outlier
The line Ỹ3 = 4.006 + (.345)X3 results
in a high R2 value – describing the
line as extremely strong. This nearperfect fit does not require further
analysis.
Method 1: Apply a Transformation
Because there is an outlier present, the Box-Cox algorithm will be
applied to lessen its effect on the model. The algorithm returned a
convenient lambda value of -2 and a best lambda value of -2.5. The
difference in R2 between these two transformations is .0065, which is
small enough to justify the application of the convenient lambda.
The resulting equation is (Ỹ3)-2 = .042 - (.002)X3. Notice the large R2
value of .923. This line does a good job describing the data (which
includes the outlier). However, notice the upward concavity of this
plot. A Box-Tidwell transformation will result in a better fit – the
SAS output is displayed below.
Page |6
Problem 2
Take-home Portion of Test 1
Page |7
The Box-Tidwell algorithm resulted with a lambda
of .3535534. I applied the transformation of .35 to
the X values, resulting in the regression line:
(Ỹ3)-2 = .078 - (.027)(X3).35. Notice that the tests
for β0=0 and β1=0 both rejected the null hypothesis
with p-values less than .0001. In particular, the
rejection of β1=0 with such a low p-value indicates that this model is significant and that a
relationship really does exist between the two variables. Notice
that the slope of the regression line is negative – the reason it
changed is due to the negative power of the Y transformation.
Furthermore, the R2 value of the final model is very high, which
indicates that our model did a good job explaining the variation
in the data.
Unfortunately, an issue does arise in confirming the assumptions
behind the model. From the graph on the right, constant
variance is not satisfied. Variance tends to decrease for larger
values of predicted Y. Further analysis of the distribution of
error terms reveals an outlier and a skewed distribution. The Shapiro-Wilk test for H0: Errors
came from a Normal Distribution resulted in a p-value of .005, so normality can
be rejected with a very high confidence. The test for appropriateness of a linear
model can be applied by making use of the
transformation displayed to the
right. The result is multiple Y
values for several X values,
which allows the lack of fit test to be applied. From the table below, the pvalue of this test (with H0: linear model is appropriate) is .795. This high
p-value suggests that the linear model is appropriate.
Problem 2
Take-home Portion of Test 1
Set 4 Analysis
The fitted regression line for set 4 is displayed to the
right. Notice that there are only two distinct values for
x, so there is no way of telling if a linear model is
appropriate. Furthermore, constant variance is difficult
to determine since there is only one point for the larger
x-value. Therefore the only assumptions that can be
checked are normality and independence of errors.
The first graph to the right displays error
values over the observation number. If the
observations are thought of as time
variables, there is a possible seasonal
impact on the errors, because they appear to
follow a sinusoidal pattern. However, more
information on this dataset would be
needed to analyze this effect. The second
graph to the right displays the distribution
of error terms next to the normal curve.
The Shapiro-Wilk test (where H0: Data came from normal
distribution) results in a p-value of .78; therefore, there is
evidence of normality of error terms.
The final model is given by Ỹ4 = .3 - 5X4. Notice that the tests
for β0=0 and β1=0 both rejected the null hypothesis with pvalues less than .05. In particular, the rejection of β1=0 with
such a low p-value indicates that this model is significant and
that a relationship really does exist between the two variables.
Furthermore, the R2 value is fairly large (.667), implying the
model does a good job of explaining the variance in Y.
Page |8
Download