Stat 401G Lab 7: Due October 22 Fall 2012 1. Dugongs are large aquatic mammals similar to manatees but native to the Indian and Pacific Oceans. Data was collected on the age (years) and length (meters) of 27 dugongs captured near Townsville in north Queensland, Australia. The data are given below. Age 1 1.5 1.5 1.5 2.5 4 5 5 7 Length 1.80 1.85 1.87 1.77 2.02 2.27 2.15 2.26 2.35 Age 8 8.5 9 9.5 9.5 10 12 12 13 Length 2.47 2.19 2.26 2.4 2.39 2.41 2.50 2.32 2.43 Age 13 14.5 15.5 15.5 16.5 17 22.5 29 31.5 Length 2.47 2.56 2.65 2.47 2.64 2.56 2.70 2.72 2.57 a) Plot Length versus Age. Describe the general pattern. As dugongs get older they tend to get longer but the trend is not linear. Their length increases more quickly when they are younger and slows down as they get to be over 20 years of age. b) Fit a simple linear model relating Length to Age. i. Give the equation of the least squares line. Predicted Length = 2.018 + 0.029*Age ii. Interpret both the estimated intercept and the estimated slope within the context of the problem. The estimated intercept does not have an interpretation within the context of the problem because a dugong with no age does not exist. One might interpret the estimated intercept as the predicted length of a newborn dugong. We should be careful because we do not have any observations for dugongs less than one year old. The estimated slope indicates that for each additional year, dugongs grow 0.029 meters on average. This is the average yearly increase in length. iii. Comment on how well the simple linear model fits the data. Be sure to mention the R2 value, RMSE, model utility, significance of variables in the model, and the plot of residuals versus Age. R2 = 0.688, so 68.8% of the variation in Length can be explained by the linear relationship with Age. RMSE = 0.156 1 The model is useful. F = 55.2092, P-value < 0.0001. The small P-value indicates that the model is statistically significant. Age is statistically significant. t = 7.43, P-value < 0.0001. The small P-value indicates that Age is a statistically significant variable. The test for model utility also tells you this because there is only one variable, Age, in the model. There is a clear curved pattern in the residuals. The simple linear regression overpredicts, then under-predicts, and over-predicts again. We could do better by adding a term to the model that would account for the curvature. c) Fit a polynomial regression (degree=2) model with Age and Age2 as the explanatory variables. Do not center variables. i. Give the equation of the least squares line. Predicted Length = 1.802 + 0.074*Age – 0.0015*Age2 ii. Why is it difficult to interpret the parameter estimates for this model? You cannot hold Age2 constant while changing Age or vice versa. Therefore it is difficult to talk about the average change in length for a one year change in Age, while holding Age2 constant. iii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE, model utility, significance of variables in the model, and the plot of residuals versus Age. R2 = 0.892, so 89.2% of the variation in Length can be explained by the linear relationship with Age and Age2. RMSE = 0.094 The model is useful. F = 99.4272, P-value < 0.0001. The small P-value indicates that the model is statistically significant. Age is statistically significant. t = 10.45, P-value < 0.0001. The small P-value indicates that Age adds significantly to the model. Age2 is statistically significant. t = –6.74, P-value < 0.0001. The small P-value indicates that Age2 adds significantly to the model. There does not seem to be a pattern in the residuals that would suggest that adding another polynomial term would improve on the fit of the data. d) Fit a polynomial regression (degree=3) with Age, Age2 and Age3 as the explanatory variables. Do not center variables. i. Give the equation of the least squares line. Predicted Length = 1.757 + 0.094*Age – 0.0033*Age2 + 0.0000383* Age3 ii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE, model utility, and the significance of variables in the model. 2 R2 = 0.898, so 89.8% of the variation in Length can be explained by the linear relationship with Age and Age2. RMSE = 0.093 The model is useful. F = 67.3881, P-value < 0.0001. The small P-value indicates that the model is statistically significant. Age is statistically significant. t = 4.92, P-value < 0.0001. The small P-value indicates that Age adds significantly to the model. Age2 is statistically significant. t = –2.08, P-value = 0.0487. The small P-value indicates that Age2 adds significantly to the model but just barely. Age3 is not statistically significant. t = 1.12, P-value = 0.2753. The P-value is not small. This indicates that Age3 does not add significantly to the model. e) For the model with Age, Age2 and Age3 as the explanatory variables, look at the distribution of residuals. Comment on the conditions of identically and normally distributed errors and the equal standard deviation condition. Be sure to refer to the appropriate plots in your comments. The box plot shows no outliers and the histogram is unimodal. This suggests that the condition of identically distributed errors is met. The histogram is mounded slightly to the right of zero. The box plot shows a fairly symmetric shape. The points on the normal quantile plot follow the diagonal normal model line very closely. Although not perfect, the condition of normally distributed errors is probably met. The plot of residuals versus age shows more variation for the younger dugongs and less variation for the older dugongs. This could be an indication of differing standard deviations. It could also be an artifact of not having very many older dugongs in the sample. The condition of equal standard deviation is in some doubt. f) Which model b), c) or d) does a better job of predicting the lengths of dugongs? To answer this question you should look at the predictions especially for older dugongs. Note: This question is not asking which model is the best statistical model. Although the model with Age and Age2 fits the definition of “best” by having a useful model with all variables adding significantly with the highest R2 value, there are some problems with the predicted lengths. Once dugongs get to be over 25 years of age, the predicted lengths actually start to decrease. The model with Age, Age2 and Age3 gives predictions that tend to level off at older ages rather than decrease. In this situation, the cubic model might be thought to produce more realistic predictions even though the cubic term is not statistically significant. See the plot on the next page. 3 3 Y 2.5 2 1.5 0 5 10 15 20 25 30 35 Age Y Predicted Length (Quadratic) Predicted Length (Cubic) g) Report the correlations between Age and Age2, Age and Age3, Age2 and Age3. Is there statistically significant multicollinearity? Correlation between Age and Age2 is 0.9440 which is statistically significant. Correlation between Age and Age3 is 0.8663 which is statistically significant. Correlation between Age2 and Age3 is 0.9809 which is statistically significant. Because there are statistically significant correlations among the explanatory variables there is statistically significant multicollinearity. h) Fit a polynomial regression (degree=2) with Age and (Age – Mean Age)2. i. Give the equation of the least squares line. Predicted Length = 1.987 + 0.040*Age – 0.0015*(Age – Mean Age)2 ii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE, model utility, significance of variables in the model, and the plot of residuals versus Age. R2 = 0.892, so 89.2% of the variation in Length can be explained by the linear relationship with Age and Age2. RMSE = 0.094 The model is useful. F = 99.4272, P-value < 0.0001. The small P-value indicates that the model is statistically significant. Age is statistically significant. t = 13.99, P-value < 0.0001. indicates that Age adds significantly to the model. The small P-value 4 (Age – 10.9444)2 is statistically significant. t = –6.74, P-value < 0.0001. The small Pvalue indicates that (Age – 10.9444)2 adds significantly to the model. There does not seem to be a pattern in the residuals that would suggest that adding another polynomial term would improve on the fit of the data. iii. How does this model compare to the model in c)? This model will give exactly the same predictions as the model in c). The summary of the fit for both models is exactly the same as is the test for model utility. Note that the t-Ratio for Age is quite a bit different (t = 10.45 for c and t = 13.99 for h). i) Fit a polynomial regression (degree=3) with Age, (Age – Mean Age)2 and (Age – Mean Age)3 as the explanatory variables. i. Give the equation of the least squares line. Predicted Length = 2.051 + 0.036*Age – 0.0020*(Age-10.9444)2 + 0.0000383*(Age – 10.9444)3 ii. Comment on how well the model fits the data. Be sure to mention the R2 value, RMSE, model utility, and the significance of variables in the model. R2 = 0.898, so 89.8% of the variation in Length can be explained by the linear relationship with Age and Age2. RMSE = 0.093 The model is useful. F = 67.3881, P-value < 0.0001. The small P-value indicates that the model is statistically significant. Age is statistically significant. t = 6.95, P-value < 0.0001. The small P-value indicates that Age adds significantly to the model. (Age – 10.9444)2 is statistically significant. t = –4.11, P-value = 0.0004. The small Pvalue indicates that Age2 adds significantly to the model but just barely. (Age – 10.9444)3 is not statistically significant. t = 1.12, P-value = 0.2753. The Pvalue is not small. This indicates that Age3 does not add significantly to the model. iii. How does this model compare to the model in d)? This model will give exactly the same predictions as the model in d). The summary of the fit for both models is exactly the same as is the test for model utility. Note that the t-Ratios for Age (t=4.92 in d and t=6.95 in h) and for the quadratic term (t=–2.08, P-value=0.0487 in d and t=–4.11, P-value=0.0004 in h) are quite different. j) Report the correlations between Age and (Age – Mean Age)2, Age and (Age – Mean Age)3, (Age – Mean Age)2 and (Age – Mean Age)3. Is there statistically significant multicollinearity? Correlation between Age and (Age – 10.9444)2 is 0.5825 which is statistically significant. Correlation between Age and (Age – 10.9444)3 is 0.8274 which is statistically significant. Correlation between (Age – 10.9444)2 and (Age – 10.9444)3 is 0.8873 which is statistically significant. Because there are statistically significant correlations among the explanatory variables there is statistically significant multicollinearity. 5