assignment4_wednesda..

advertisement
STAT 252
LAB X7
Assignment No. 4
Thursday, August 4, 2005
Angie Chiu
Lab Instructor: M. Wang
1. The study design that is being used here is an observation study
with one dependent variable and seven explanatory variables. From this
design, we can infer whether or not any of the seven variables have
significant effects on the amount of nitrogen in the river.
Yes, the seven explanatory variables we are considering involve events
associated with human activity. We can observe whether or not these
variables have the same effect on river nitrogen concentration, and so we can use a linear
regression model. Because this is an observational study, we cannot make cause-andeffect conclusions on the effect of human activities on river nitrogen concentrations.
2. The matrix of scatterplots is symmetric.
River Nitrogen: Original Scale
NO3
DISCHARG
RUNOFF
AREA
DENSITY
DEP
NPREC
PREC
The variables exhibiting a strong linear relationship are NO3 and
Density. NO3 and Deposition exhibit a moderate linear relationship. Deposition (DEP)
and Nitrogen Precipitation (NPREC) also exhibit a strong linear relationship.
2. b) The relationship between NO3 and discharge is not strong, as the
scatterplot forms a vertical line. The relationship between NO3 and
runoff is also insignificant, as the points appear to be randomly
scattered. The relationship between NO3 and runoff is insignificant.
The relationship between NO3 and Density appears to be significant,
as it results in scatterplots showing a positive sloping line--a
significant linear relationship. The relationship between NO3 and Deposition also
appears to be significant, as the scatterplot shows a linear
relationship between the variables. The relationship between NO3 and
NPREC appears to be significant also, as we see a linear relationship
2
in the scatterplots. The relationship between NO3 and PREC appears to
be insignificant, as the points in the scatterplot appear to be
randomly scattered.
Looking at our correlation table, NO3 and density seem to have the
highest correlation--0.841. Thus, if we were to choose one explanatory
variable to predict NO3, we would choose Density. The variable with
the second highest correlation appears to be NPREC, with 68.2%
correlation.
When we look at the scatterplots of variables on the original scale,
it does not look like a linear model is appropriate for describing the
relationship between NO3 and the seven predictors. The data points in many of the plots
appear to be randomly scattered. To witness linear relationships between the data, we
may need to apply log transformations to the different variables.
NEED FOR LOG TRANSFORM
3.a)
LNNO3
LNDISCHA
LNRUNOFF
LNAREA
LNDENSIT
LNDEP
LNNPREC
LNPREC
The log transformation was effective because the scatterplots for many
of the weakly corresponding variables, the scatterplots appear to show
a linear pattern in the data. In the scatterplot for the original
data, the variables with small value correlations appeared to be
randomly scattered or scattered about a vertical line. For example, LNNO3 and
3
LNDISCHA now form a linear pattern, whereas the correlation resulted in
a vertical relationship before.
This matrix of scatterplots is similar to those in question 1 because
the strong correlations in the original data still appear as strong
correlations in the log-transformed data--including those between NO3
and Density, for example. With the log-transformed data, we can also
tell more clearly by looking at the scatterplots that NO3 has a
stronger correlation with Density than it does with NPREC.
b)
Correlations
LNNO3
LNNO3
LNDISCHA
LNRUNOFF
LNAREA
LNDENSIT
LNDEP
LNNPREC
LNPREC
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
1
.
42
-.380*
.013
42
.015
.922
42
-.349*
.023
42
.870**
.000
42
.659**
.000
42
.686**
.000
42
-.063
.691
42
LNDISCHA LNRUNOFF
LNAREA
LNDENSIT
-.380*
.015
-.349*
.870**
.013
.922
.023
.000
42
42
42
42
1
.056
.854**
-.317*
.
.726
.000
.041
42
42
42
42
.056
1
-.453**
.124
.726
.
.003
.433
42
42
42
42
.854**
-.453**
1
-.349*
.000
.003
.
.024
42
42
42
42
-.317*
.124
-.349*
1
.041
.433
.024
.
42
42
42
42
-.219
.316*
-.371*
.664**
.163
.041
.016
.000
42
42
42
42
-.354*
-.083
-.291
.634**
.021
.602
.061
.000
42
42
42
42
.254
.715**
-.133
.038
.105
.000
.400
.811
42
42
42
42
LNDEP
LNNPREC
LNPREC
.659**
.686**
-.063
.000
.000
.691
42
42
42
-.219
-.354*
.254
.163
.021
.105
42
42
42
.316*
-.083
.715**
.041
.602
.000
42
42
42
-.371*
-.291
-.133
.016
.061
.400
42
42
42
.664**
.634**
.038
.000
.000
.811
42
42
42
1
.841**
.266
.
.000
.089
42
42
42
.841**
1
-.297
.000
.
.056
42
42
42
.266
-.297
1
.089
.056
.
42
42
42
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).
The bivariate correlations show that there are significant
correlations between: LNNO3 and LNDISCHA, LNNO3 and LNAREA, LNNO3 and
LNDENSIT, LNNO3 and LNDEP, LNNO3 and LNPREC, LNDISCHA, and LNAREA,
LNDISCHA and LNDENSIT, LNDISCHA and LNNPREC, LNRUNOFF and LNAREA,
LNRUNOFF and LNPREC, LNAREA and LNDISCHA, LNAREA and LNDENSIT,
LNAREA and LNDEP, LNDENSIT and LNDEP, LNDENSIT and LNNPREC,
LNDEP and LNDENSIT, LNDEP and LNNPREC, LNNPREC and LNDEP.
The matrix suggests there are many strong correlations between the 7
explanatory variables. There are significant linear relationships
between some of the variables and thus, collinearity may be a problem
in this case. Because of the high correlation between predictors, we
4
are not sure what effect each variable has on the response
variable--the explanatory variables may influence each other.
The correlations that are approximately 85% indicate that the
collinearity between the variables would cause a problem in our
analysis. There are about 5 relations between variables showing an
~80% correlation.
4. μ (log NO3) = β0 + β1 * lnDischarge + β2 * lnRunoff + β3 * lnArea
+ β4 * lnDensity + β5 * lnDeposition + β6 * lnNPREC + β7 * lnPREC +
error
We assume that the our error term is normally distributed, and has a
mean of 0. We also assume that our sample data is normally
distributed.
5
5.
Variables Entered/Removedb
Model
1
Variables
Entered
LNPREC,
LNDENSI
T,
LNAREA,
LNDEP,
LNRUNOF
F,
LNDISCH
A,
a
LNNPREC
Variables
Removed
Method
.
2
.
LNRUNOF
F
.
LNDEP
.
LNPREC
.
LNAREA
.
LNDISCHA
3
4
5
6
Enter
Backward
(criterion:
Probabilit
y of
F-to-remo
ve >=
.100).
Backward
(criterion:
Probabilit
y of
F-to-remo
ve >=
.100).
Backward
(criterion:
Probabilit
y of
F-to-remo
ve >=
.100).
Backward
(criterion:
Probabilit
y of
F-to-remo
ve >=
.100).
Backward
(criterion:
Probabilit
y of
F-to-remo
ve >=
.100).
a. All requested variables entered.
b. Dependent Variable: LNNO3
5. a) Five of the seven variables got eliminated by the backward
eliminiation procedure. LNRUNOFF was deleted first, followed in order
by LNDEP, LNPREC, LNAREA, and LNDISCHARGE.
6
5. b) The estimated regression equation for the final model determined
by the backward elimination procedure is:
μ (log NO3) = 1.065 + 0.471 * lnDensity + 0.293 * lnNPREC.
The percentage of the variation in log-NO3 explained by the
explanatory variables in the model is 78.7 %.
5. c) The p-value of the test for overall significance of the
regression model is 0.000 for a 2-tailed test, indicating that at
least one of the variables has an effect on river nitrogen
concentration.
For lnDENSIT, the p-value for the t-test is 0.000 and for lnNPREC, the
p-value associated with the t-test is also 0.000. The variable
lnDENSIT contributes to the model, as it has a positive linear
correlation with LNNO3. The variable lnNPREC is significant.
Given the other variables in the model, both lnDENSIT and lnNPREC are
significant at a 0.05 level of significance.
6. a)
Scatterplot
Dependent Variable: LNNO3
2
1
0
-1
-2
-3
-2
-1
0
1
2
Regression Standardized Predicted Value
Yes, there is evidence that the variance of the residuals
increases with increasing fitted values. At the high predicted values,
the residuals do not appear in a horizontal band, but appear to form a
positively sloping line. This indicates inequality of variance. There are no
outliers in the data. The range for the standardized values is from -3
to 2, and the range for the predicted values is from -2 to 2. The
spread is smaller for the predicted values than for the standardized
values.
7
6.b)
Normal P-P Plot of Regression Standardized Residual
Dependent Variable: LNNO3
1.00
Expected Cum Prob
.75
.50
.25
0.00
0.00
.25
.50
.75
1.00
Observed Cum Prob
Yes, there is some evidence that the assumption of normality is
violated. At the lowest and highest values, the residuals fall closely
to the best fit line, and suggest normality. However, in the middle
range of values, the residuals show slight deviations from the best
fit line. This is a slight and not serious departure from normality.
7. a) The R squared value of the model is 0.145.
b) The influential case is that of observation number 3, for the
Caraugh River in Ireland. The case statistics are -2.94278 for
studentized residuals, 1.00031 for Cook's distance, and 0.16386 for
leverages.
8
Scatterplot
Dependent Variable: LNNO3
2
1
0
-1
-2
-3
-3
-2
-1
0
1
2
3
Regression Standardized Predicted Value
The estimated regression equation (forward regression) is:
μ (log NO3) = β0 + -0.363 * lnDischarge
The R squared value for this model is 0.290.
c) μ (log NO3) = β0 + β1 * lnDischarge + β2 * lnDep + β3 * lnPrec
The regression coefficients are -0.210 for lnDischarge, 0.492 for
lnDeposition, and 0.249 for lnPrec.
Hypotheses:
Ho: β2 = β3 = 0
Ha: at least one of β2 or β3 is 0.
The value of the F-Statistic is:
F-statistic = (SSRreduced – SSRfullmodel)/ (Df reduced – Df fullmodel)
SSR fulmodell/Df fullmodel
F-statistic = (50.993 – 33.268))/ (39-37)
(33.268 / 37)
= 9.8566
The corresponding p-value is 0.0003. Thus, at a 0.05 level of
9
significance, we reject the null hypothesis and conclude that the explanatory variables
have an effect on nitrogen level concentration.
8. μ (log NO3) = β0 + β1 * lnDischarge + β2 * lnDep + β3 * lnPrec + β4 * lnDensity
Ho: β4 = 0
Ha: β4 ≠ 0
F-statistic = (SSRreduced – SSRFullmodel)/ (Dfreduced – Dffullmodel)
SSRfull/Dffull
F-statistic = (33.268 – 13.003))/ (37-36)
(13.003 / 36)
= 56.105
Coefficientsa
Model
1
(Constant)
LNDISCHA
LNNPREC
LNDEP
LNDENSIT
Unstandardized
Coefficients
B
Std. Error
2.519
.822
-.139
.058
9.863E-02
.189
2.960E-02
.181
.464
.062
Standardized
Coefficients
Beta
-.206
.075
.023
.734
t
3.063
-2.409
.522
.163
7.490
Sig.
.004
.021
.605
.871
.000
a. Dependent Variable: LNNO3
The test statistic for this test is 7.2000, and its corresponding p-value is 0.000. At a 5%
level of significance, we reject the null hypothesis. Thus, population density has a
significant effect on river nitrogen concentration.
Also, the t-test p-value for lnDensity is 0.000.
10
Download