Stat 401G Lab 7: Due October 22 Fall 2012

advertisement
Stat 401G
Lab 7: Due October 22
Fall 2012
1. Dugongs are large aquatic mammals similar to manatees but native to the Indian and Pacific
Oceans. Data was collected on the age (years) and length (meters) of 27 dugongs captured near
Townsville in north Queensland, Australia. The data are given below.
Age
1
1.5
1.5
1.5
2.5
4
5
5
7
Length
1.80
1.85
1.87
1.77
2.02
2.27
2.15
2.26
2.35
Age
8
8.5
9
9.5
9.5
10
12
12
13
Length
2.47
2.19
2.26
2.4
2.39
2.41
2.50
2.32
2.43
Age
13
14.5
15.5
15.5
16.5
17
22.5
29
31.5
Length
2.47
2.56
2.65
2.47
2.64
2.56
2.70
2.72
2.57
a) Plot Length versus Age. Describe the general pattern.
As dugongs get older they tend to get longer but the trend is not linear. Their length
increases more quickly when they are younger and slows down as they get to be over 20
years of age.
b) Fit a simple linear model relating Length to Age.
i. Give the equation of the least squares line.
Predicted Length = 2.018 + 0.029*Age
ii. Interpret both the estimated intercept and the estimated slope within the context of the
problem.
The estimated intercept does not have an interpretation within the context of the
problem because a dugong with no age does not exist. One might interpret the
estimated intercept as the predicted length of a newborn dugong. We should be
careful because we do not have any observations for dugongs less than one year old.
The estimated slope indicates that for each additional year, dugongs grow 0.029
meters on average. This is the average yearly increase in length.
iii. Comment on how well the simple linear model fits the data. Be sure to mention the R2
value, RMSE, model utility, significance of variables in the model, and the plot of
residuals versus Age.
R2 = 0.688, so 68.8% of the variation in Length can be explained by the linear
relationship with Age.
RMSE = 0.156
1
The model is useful. F = 55.2092, P-value < 0.0001. The small P-value indicates
that the model is statistically significant.
Age is statistically significant. t = 7.43, P-value < 0.0001. The small P-value
indicates that Age is a statistically significant variable. The test for model utility
also tells you this because there is only one variable, Age, in the model.
There is a clear curved pattern in the residuals. The simple linear regression overpredicts, then under-predicts, and over-predicts again. We could do better by
adding a term to the model that would account for the curvature.
c) Fit a polynomial regression (degree=2) model with Age and Age2 as the explanatory
variables. Do not center variables.
i. Give the equation of the least squares line.
Predicted Length = 1.802 + 0.074*Age – 0.0015*Age2
ii. Why is it difficult to interpret the parameter estimates for this model?
You cannot hold Age2 constant while changing Age or vice versa. Therefore it is
difficult to talk about the average change in length for a one year change in Age,
while holding Age2 constant.
iii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE,
model utility, significance of variables in the model, and the plot of residuals versus
Age.
R2 = 0.892, so 89.2% of the variation in Length can be explained by the linear
relationship with Age and Age2.
RMSE = 0.094
The model is useful. F = 99.4272, P-value < 0.0001. The small P-value indicates
that the model is statistically significant.
Age is statistically significant. t = 10.45, P-value < 0.0001. The small P-value
indicates that Age adds significantly to the model.
Age2 is statistically significant. t = –6.74, P-value < 0.0001. The small P-value
indicates that Age2 adds significantly to the model.
There does not seem to be a pattern in the residuals that would suggest that adding
another polynomial term would improve on the fit of the data.
d) Fit a polynomial regression (degree=3) with Age, Age2 and Age3 as the explanatory
variables. Do not center variables.
i. Give the equation of the least squares line.
Predicted Length = 1.757 + 0.094*Age – 0.0033*Age2 + 0.0000383* Age3
ii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE,
model utility, and the significance of variables in the model.
2
R2 = 0.898, so 89.8% of the variation in Length can be explained by the linear
relationship with Age and Age2.
RMSE = 0.093
The model is useful. F = 67.3881, P-value < 0.0001. The small P-value indicates
that the model is statistically significant.
Age is statistically significant. t = 4.92, P-value < 0.0001. The small P-value
indicates that Age adds significantly to the model.
Age2 is statistically significant. t = –2.08, P-value = 0.0487. The small P-value
indicates that Age2 adds significantly to the model but just barely.
Age3 is not statistically significant. t = 1.12, P-value = 0.2753. The P-value is not
small. This indicates that Age3 does not add significantly to the model.
e) For the model with Age, Age2 and Age3 as the explanatory variables, look at the distribution
of residuals. Comment on the conditions of identically and normally distributed errors and
the equal standard deviation condition. Be sure to refer to the appropriate plots in your
comments.
The box plot shows no outliers and the histogram is unimodal. This suggests that the
condition of identically distributed errors is met.
The histogram is mounded slightly to the right of zero. The box plot shows a fairly
symmetric shape. The points on the normal quantile plot follow the diagonal normal
model line very closely. Although not perfect, the condition of normally distributed
errors is probably met.
The plot of residuals versus age shows more variation for the younger dugongs and less
variation for the older dugongs. This could be an indication of differing standard
deviations. It could also be an artifact of not having very many older dugongs in the
sample. The condition of equal standard deviation is in some doubt.
f) Which model b), c) or d) does a better job of predicting the lengths of dugongs? To answer
this question you should look at the predictions especially for older dugongs. Note: This
question is not asking which model is the best statistical model.
Although the model with Age and Age2 fits the definition of “best” by having a useful
model with all variables adding significantly with the highest R2 value, there are some
problems with the predicted lengths. Once dugongs get to be over 25 years of age, the
predicted lengths actually start to decrease. The model with Age, Age2 and Age3 gives
predictions that tend to level off at older ages rather than decrease. In this situation,
the cubic model might be thought to produce more realistic predictions even though the
cubic term is not statistically significant. See the plot on the next page.
3
3
Y
2.5
2
1.5
0
5
10
15
20
25
30
35
Age
Y
Predicted Length (Quadratic)
Predicted Length (Cubic)
g) Report the correlations between Age and Age2, Age and Age3, Age2 and Age3. Is there
statistically significant multicollinearity?
Correlation between Age and Age2 is 0.9440 which is statistically significant.
Correlation between Age and Age3 is 0.8663 which is statistically significant.
Correlation between Age2 and Age3 is 0.9809 which is statistically significant.
Because there are statistically significant correlations among the explanatory variables
there is statistically significant multicollinearity.
h) Fit a polynomial regression (degree=2) with Age and (Age – Mean Age)2.
i. Give the equation of the least squares line.
Predicted Length = 1.987 + 0.040*Age – 0.0015*(Age – Mean Age)2
ii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE,
model utility, significance of variables in the model, and the plot of residuals versus
Age.
R2 = 0.892, so 89.2% of the variation in Length can be explained by the linear
relationship with Age and Age2.
RMSE = 0.094
The model is useful. F = 99.4272, P-value < 0.0001. The small P-value indicates
that the model is statistically significant.
Age is statistically significant. t = 13.99, P-value < 0.0001.
indicates that Age adds significantly to the model.
The small P-value
4
(Age – 10.9444)2 is statistically significant. t = –6.74, P-value < 0.0001. The small Pvalue indicates that (Age – 10.9444)2 adds significantly to the model.
There does not seem to be a pattern in the residuals that would suggest that adding
another polynomial term would improve on the fit of the data.
iii. How does this model compare to the model in c)?
This model will give exactly the same predictions as the model in c). The summary
of the fit for both models is exactly the same as is the test for model utility. Note
that the t-Ratio for Age is quite a bit different (t = 10.45 for c and t = 13.99 for h).
i)
Fit a polynomial regression (degree=3) with Age, (Age – Mean Age)2 and (Age – Mean
Age)3 as the explanatory variables.
i. Give the equation of the least squares line.
Predicted Length = 2.051 + 0.036*Age – 0.0020*(Age-10.9444)2
+ 0.0000383*(Age – 10.9444)3
ii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE,
model utility, and the significance of variables in the model.
R2 = 0.898, so 89.8% of the variation in Length can be explained by the linear
relationship with Age and Age2.
RMSE = 0.093
The model is useful. F = 67.3881, P-value < 0.0001. The small P-value indicates
that the model is statistically significant.
Age is statistically significant. t = 6.95, P-value < 0.0001. The small P-value
indicates that Age adds significantly to the model.
(Age – 10.9444)2 is statistically significant. t = –4.11, P-value = 0.0004. The small Pvalue indicates that Age2 adds significantly to the model but just barely.
(Age – 10.9444)3 is not statistically significant. t = 1.12, P-value = 0.2753. The Pvalue is not small. This indicates that Age3 does not add significantly to the model.
iii. How does this model compare to the model in d)?
This model will give exactly the same predictions as the model in d). The summary
of the fit for both models is exactly the same as is the test for model utility. Note
that the t-Ratios for Age (t=4.92 in d and t=6.95 in h) and for the quadratic term
(t=–2.08, P-value=0.0487 in d and t=–4.11, P-value=0.0004 in h) are quite different.
j)
Report the correlations between Age and (Age – Mean Age)2, Age and (Age – Mean Age)3,
(Age – Mean Age)2 and (Age – Mean Age)3.
Is there statistically significant
multicollinearity?
Correlation between Age and (Age – 10.9444)2 is 0.5825 which is statistically significant.
Correlation between Age and (Age – 10.9444)3 is 0.8274 which is statistically significant.
Correlation between (Age – 10.9444)2 and (Age – 10.9444)3 is 0.8873 which is
statistically significant.
Because there are statistically significant correlations among the explanatory
variables there is statistically significant multicollinearity.
5
Download