STAT 252 LAB X7 Assignment No. 4 Thursday, August 4, 2005 Angie Chiu Lab Instructor: M. Wang 1. The study design that is being used here is an observation study with one dependent variable and seven explanatory variables. From this design, we can infer whether or not any of the seven variables have significant effects on the amount of nitrogen in the river. Yes, the seven explanatory variables we are considering involve events associated with human activity. We can observe whether or not these variables have the same effect on river nitrogen concentration, and so we can use a linear regression model. Because this is an observational study, we cannot make cause-andeffect conclusions on the effect of human activities on river nitrogen concentrations. 2. The matrix of scatterplots is symmetric. River Nitrogen: Original Scale NO3 DISCHARG RUNOFF AREA DENSITY DEP NPREC PREC The variables exhibiting a strong linear relationship are NO3 and Density. NO3 and Deposition exhibit a moderate linear relationship. Deposition (DEP) and Nitrogen Precipitation (NPREC) also exhibit a strong linear relationship. 2. b) The relationship between NO3 and discharge is not strong, as the scatterplot forms a vertical line. The relationship between NO3 and runoff is also insignificant, as the points appear to be randomly scattered. The relationship between NO3 and runoff is insignificant. The relationship between NO3 and Density appears to be significant, as it results in scatterplots showing a positive sloping line--a significant linear relationship. The relationship between NO3 and Deposition also appears to be significant, as the scatterplot shows a linear relationship between the variables. The relationship between NO3 and NPREC appears to be significant also, as we see a linear relationship 2 in the scatterplots. The relationship between NO3 and PREC appears to be insignificant, as the points in the scatterplot appear to be randomly scattered. Looking at our correlation table, NO3 and density seem to have the highest correlation--0.841. Thus, if we were to choose one explanatory variable to predict NO3, we would choose Density. The variable with the second highest correlation appears to be NPREC, with 68.2% correlation. When we look at the scatterplots of variables on the original scale, it does not look like a linear model is appropriate for describing the relationship between NO3 and the seven predictors. The data points in many of the plots appear to be randomly scattered. To witness linear relationships between the data, we may need to apply log transformations to the different variables. NEED FOR LOG TRANSFORM 3.a) LNNO3 LNDISCHA LNRUNOFF LNAREA LNDENSIT LNDEP LNNPREC LNPREC The log transformation was effective because the scatterplots for many of the weakly corresponding variables, the scatterplots appear to show a linear pattern in the data. In the scatterplot for the original data, the variables with small value correlations appeared to be randomly scattered or scattered about a vertical line. For example, LNNO3 and 3 LNDISCHA now form a linear pattern, whereas the correlation resulted in a vertical relationship before. This matrix of scatterplots is similar to those in question 1 because the strong correlations in the original data still appear as strong correlations in the log-transformed data--including those between NO3 and Density, for example. With the log-transformed data, we can also tell more clearly by looking at the scatterplots that NO3 has a stronger correlation with Density than it does with NPREC. b) Correlations LNNO3 LNNO3 LNDISCHA LNRUNOFF LNAREA LNDENSIT LNDEP LNNPREC LNPREC Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 . 42 -.380* .013 42 .015 .922 42 -.349* .023 42 .870** .000 42 .659** .000 42 .686** .000 42 -.063 .691 42 LNDISCHA LNRUNOFF LNAREA LNDENSIT -.380* .015 -.349* .870** .013 .922 .023 .000 42 42 42 42 1 .056 .854** -.317* . .726 .000 .041 42 42 42 42 .056 1 -.453** .124 .726 . .003 .433 42 42 42 42 .854** -.453** 1 -.349* .000 .003 . .024 42 42 42 42 -.317* .124 -.349* 1 .041 .433 .024 . 42 42 42 42 -.219 .316* -.371* .664** .163 .041 .016 .000 42 42 42 42 -.354* -.083 -.291 .634** .021 .602 .061 .000 42 42 42 42 .254 .715** -.133 .038 .105 .000 .400 .811 42 42 42 42 LNDEP LNNPREC LNPREC .659** .686** -.063 .000 .000 .691 42 42 42 -.219 -.354* .254 .163 .021 .105 42 42 42 .316* -.083 .715** .041 .602 .000 42 42 42 -.371* -.291 -.133 .016 .061 .400 42 42 42 .664** .634** .038 .000 .000 .811 42 42 42 1 .841** .266 . .000 .089 42 42 42 .841** 1 -.297 .000 . .056 42 42 42 .266 -.297 1 .089 .056 . 42 42 42 *. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed). The bivariate correlations show that there are significant correlations between: LNNO3 and LNDISCHA, LNNO3 and LNAREA, LNNO3 and LNDENSIT, LNNO3 and LNDEP, LNNO3 and LNPREC, LNDISCHA, and LNAREA, LNDISCHA and LNDENSIT, LNDISCHA and LNNPREC, LNRUNOFF and LNAREA, LNRUNOFF and LNPREC, LNAREA and LNDISCHA, LNAREA and LNDENSIT, LNAREA and LNDEP, LNDENSIT and LNDEP, LNDENSIT and LNNPREC, LNDEP and LNDENSIT, LNDEP and LNNPREC, LNNPREC and LNDEP. The matrix suggests there are many strong correlations between the 7 explanatory variables. There are significant linear relationships between some of the variables and thus, collinearity may be a problem in this case. Because of the high correlation between predictors, we 4 are not sure what effect each variable has on the response variable--the explanatory variables may influence each other. The correlations that are approximately 85% indicate that the collinearity between the variables would cause a problem in our analysis. There are about 5 relations between variables showing an ~80% correlation. 4. μ (log NO3) = β0 + β1 * lnDischarge + β2 * lnRunoff + β3 * lnArea + β4 * lnDensity + β5 * lnDeposition + β6 * lnNPREC + β7 * lnPREC + error We assume that the our error term is normally distributed, and has a mean of 0. We also assume that our sample data is normally distributed. 5 5. Variables Entered/Removedb Model 1 Variables Entered LNPREC, LNDENSI T, LNAREA, LNDEP, LNRUNOF F, LNDISCH A, a LNNPREC Variables Removed Method . 2 . LNRUNOF F . LNDEP . LNPREC . LNAREA . LNDISCHA 3 4 5 6 Enter Backward (criterion: Probabilit y of F-to-remo ve >= .100). Backward (criterion: Probabilit y of F-to-remo ve >= .100). Backward (criterion: Probabilit y of F-to-remo ve >= .100). Backward (criterion: Probabilit y of F-to-remo ve >= .100). Backward (criterion: Probabilit y of F-to-remo ve >= .100). a. All requested variables entered. b. Dependent Variable: LNNO3 5. a) Five of the seven variables got eliminated by the backward eliminiation procedure. LNRUNOFF was deleted first, followed in order by LNDEP, LNPREC, LNAREA, and LNDISCHARGE. 6 5. b) The estimated regression equation for the final model determined by the backward elimination procedure is: μ (log NO3) = 1.065 + 0.471 * lnDensity + 0.293 * lnNPREC. The percentage of the variation in log-NO3 explained by the explanatory variables in the model is 78.7 %. 5. c) The p-value of the test for overall significance of the regression model is 0.000 for a 2-tailed test, indicating that at least one of the variables has an effect on river nitrogen concentration. For lnDENSIT, the p-value for the t-test is 0.000 and for lnNPREC, the p-value associated with the t-test is also 0.000. The variable lnDENSIT contributes to the model, as it has a positive linear correlation with LNNO3. The variable lnNPREC is significant. Given the other variables in the model, both lnDENSIT and lnNPREC are significant at a 0.05 level of significance. 6. a) Scatterplot Dependent Variable: LNNO3 2 1 0 -1 -2 -3 -2 -1 0 1 2 Regression Standardized Predicted Value Yes, there is evidence that the variance of the residuals increases with increasing fitted values. At the high predicted values, the residuals do not appear in a horizontal band, but appear to form a positively sloping line. This indicates inequality of variance. There are no outliers in the data. The range for the standardized values is from -3 to 2, and the range for the predicted values is from -2 to 2. The spread is smaller for the predicted values than for the standardized values. 7 6.b) Normal P-P Plot of Regression Standardized Residual Dependent Variable: LNNO3 1.00 Expected Cum Prob .75 .50 .25 0.00 0.00 .25 .50 .75 1.00 Observed Cum Prob Yes, there is some evidence that the assumption of normality is violated. At the lowest and highest values, the residuals fall closely to the best fit line, and suggest normality. However, in the middle range of values, the residuals show slight deviations from the best fit line. This is a slight and not serious departure from normality. 7. a) The R squared value of the model is 0.145. b) The influential case is that of observation number 3, for the Caraugh River in Ireland. The case statistics are -2.94278 for studentized residuals, 1.00031 for Cook's distance, and 0.16386 for leverages. 8 Scatterplot Dependent Variable: LNNO3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 Regression Standardized Predicted Value The estimated regression equation (forward regression) is: μ (log NO3) = β0 + -0.363 * lnDischarge The R squared value for this model is 0.290. c) μ (log NO3) = β0 + β1 * lnDischarge + β2 * lnDep + β3 * lnPrec The regression coefficients are -0.210 for lnDischarge, 0.492 for lnDeposition, and 0.249 for lnPrec. Hypotheses: Ho: β2 = β3 = 0 Ha: at least one of β2 or β3 is 0. The value of the F-Statistic is: F-statistic = (SSRreduced – SSRfullmodel)/ (Df reduced – Df fullmodel) SSR fulmodell/Df fullmodel F-statistic = (50.993 – 33.268))/ (39-37) (33.268 / 37) = 9.8566 The corresponding p-value is 0.0003. Thus, at a 0.05 level of 9 significance, we reject the null hypothesis and conclude that the explanatory variables have an effect on nitrogen level concentration. 8. μ (log NO3) = β0 + β1 * lnDischarge + β2 * lnDep + β3 * lnPrec + β4 * lnDensity Ho: β4 = 0 Ha: β4 ≠ 0 F-statistic = (SSRreduced – SSRFullmodel)/ (Dfreduced – Dffullmodel) SSRfull/Dffull F-statistic = (33.268 – 13.003))/ (37-36) (13.003 / 36) = 56.105 Coefficientsa Model 1 (Constant) LNDISCHA LNNPREC LNDEP LNDENSIT Unstandardized Coefficients B Std. Error 2.519 .822 -.139 .058 9.863E-02 .189 2.960E-02 .181 .464 .062 Standardized Coefficients Beta -.206 .075 .023 .734 t 3.063 -2.409 .522 .163 7.490 Sig. .004 .021 .605 .871 .000 a. Dependent Variable: LNNO3 The test statistic for this test is 7.2000, and its corresponding p-value is 0.000. At a 5% level of significance, we reject the null hypothesis. Thus, population density has a significant effect on river nitrogen concentration. Also, the t-test p-value for lnDensity is 0.000. 10