Uploaded by wilson were

E. coli Prediction Using Regression Analysis

advertisement
1. MULTIPLE LINEAR REGRESSION FOR ECOLI DRY SEASON WITH ALL
INDEPENDENT VARIABLES
In this thesis the IBM SPSS software version 26 was used to conduct Multiple
Linear regression. Let us see what happened when we used all 7 explanatory variables
(Risklevel, ModeCons, SourceType, DepthWell, DistanceNPL, Geology and
Ownership) as predictors in our model. After processing Multiple Linear regression,
statistical outputs were generated.
Step 1: First the study will try to implement the overall model evaluation.
Model Summary
Std. Error of the
Model
R
R Square
Adjusted R Square
Estimate
1
.559a
.312
.186
.389
a. Predictors: (Constant), Risklevel, Ownership, DepthWell, SourceType, Geology,
DistanceNPL, ModeCons


R= the multiple correlation coefficient. The value of 0.559 in this model
indicates a good level of prediction. The figure reveals a strong correlation
between Ecoli & the seven predicator variables.
R Square The value of 0.312 in this model indicates the coefficient of
determination obtained as a result of squaring the correlation coefficient (R2).
The value of 0.312 reveals that our independent variables explain 31.2% of the
variability of Ecoli (dependent variable).

Adjusted R Square: The value of 0.186 in this model reveals the accuracy of
the model. A value of 0.186 in this Model indicates true 18.6% of variation in
the outcome variable (Ecoli) is explained by the predictors which are to keep
in the model.

Std. Error of the Estimate: The value of 0.389 is a measure of the precision
of the model. The value reveals how wrong you could be if you used the
regression model to make predictions or to estimate the Ecoli, in this case
38.9%.
Step 2: Examining the Statistical significance
ANOVAa
Sum of
Model
Squares
df
Mean Square
F
Sig.
1
Regression
2.615
7
.374
2.467
.034b
Residual
5.755
38
.151
Total
8.370
45
a. Dependent Variable: Category_A
b. Predictors: (Constant), Risklevel, Ownership, DepthWell, SourceType, Geology,
DistanceNPL, ModeCons
The F-ratio in the ANOVA table above tests whether the overall regression model
is a good fit for the data. The table shows that the independent variables
statistically significantly predict the dependent variable, F(7, 38) = 2.467, p < 0.05
(i.e., the regression model is a good fit of the data).
Step 3: Estimated model coefficients
Coefficientsa
Unstandardized
Coefficients
Model
B
Std. Error
1 (Constant)
.735
.490
SourceType -.168
.177
Geology
.123
.135
DistanceNPL -.459
.188
DepthWell
.299
.223
ModeCons
.063
.141
Ownership
-.037
.132
Risklevel
-.019
.084
a. Dependent Variable: Category_A
Standardized
Coefficients
Beta
-.141
.139
-.408
.236
.083
-.042
-.031
t
1.499
-.949
.912
2.436
1.337
.449
-.276
-.222
Sig.
.142
.349
.367
.020
.189
.656
.784
.826
95.0%
Confidence
Interval for B
Lower Upper
Bound Bound
-.258
1.728
-.526
.190
-.150
.395
-.841
-.078
-.154
-.223
-.304
-.188
.751
.349
.231
.151
1. The general form of the equation to predict Ecoli from SourceType, Geology,
DistanceNPL, DepthWell, ModeCons, Ownership, Risklevel is:
Ecoli= 0.735 – (.168x SourceType) –(.123x Geology) –(.459x DistanceNPL) + (.299x
DepthWell) + (.063x ModeCons) – (.037x Ownership) – (.019x Risklevel)
2. Unstandardized coefficients indicate how much the dependent variable (Ecoli)
varies with an independent variable when all other independent variables are held
constant.
Example: The unstandardized coefficient, B1, for SourceType =-.168. This means
that for each one unit decrease in SourceType, there is a decrease in Ecoli of 0.168.
Step 4. Statistical significance of the independent variables
Coefficientsa
Unstandardized
Coefficients
Model
B
Std. Error
1 (Constant)
.735
.490
SourceType -.168
.177
Geology
.123
.135
DistanceNPL -.459
.188
DepthWell
.299
.223
ModeCons
.063
.141
Ownership
-.037
.132
Risklevel
-.019
.084
a. Dependent Variable: Category_A
Standardized
Coefficients
Beta
-.141
.139
-.408
.236
.083
-.042
-.031
t
1.499
-.949
.912
2.436
1.337
.449
-.276
-.222
Sig.
.142
.349
.367
.020
95.0%
Confidence
Interval for B
Lower Upper
Bound Bound
-.258
1.728
-.526
.190
-.150
.395
-.841
-.078
.189
.656
.784
.826
-.154
-.223
-.304
-.188
.751
.349
.231
.151
The study tested for the statistical significance of each of the independent
variables. This test is meant to verify whether the unstandardized (or standardized)
coefficients are equal to 0 (zero) in the population. If p < .05, we can conclude that
the coefficients are statistically significantly different to 0 (zero).
Observation: From the "Sig." column, almost all the independent variable
coefficients are not statistically significant except Only Distance from the nearest
pit latrine (DistanceNPL).
Step 5. Summary
A multiple regression was run to predict Ecoli from SourceType, Geology,
DistanceNPL, DepthWell, ModeCons, Ownership and Risklevel. These variables did
not statistically significantly predicted Ecoli, F(7, 38) = 2.467, p < 0.05, R2 = .312.
Only one variable (DistanceNPL) added statistically significantly to the prediction, p
< .05.
2. BINARY LOGISTIC REGRESSION FOR ECOLI DRY SEASON WITH ALL
INDEPENDENT VARIABLES
In this thesis the IBM SPSS software version 26 was used to conduct logistic
regression. Let us see what happened when we used all 7 explanatory variables
(Risklevel, ModeCons, SourceType, DepthWell, DistanceNPL, Geology and
Ownership) as predictors in our model. After processing binary logistic regression,
statistical outputs were generated. Based on the “Case Processing Summary” output
it is visible that 45 cases were used out of 51. It is explained by the fact, that six
cases included missing data.
Table 4. 1 Case Processing Summary
Unweighted Casesa
N
Percent
Selected Cases
Included in Analysis
45
88.2
Missing Cases
6
11.8
Total
51
100.0
Unselected Cases
0
.0
Total
51
100.0
a. If weight is in effect, see classification table for the total number of cases.
Step 1: First the study will try to implement the overall model evaluation.
Table 4.2 Model Summary
Step
-2 Log likelihood
Cox & Snell R Square Nagelkerke R Square
a
1
49.485
.013
.019
a. Estimation terminated at iteration number 20 because maximum iterations has
been reached. Final solution cannot be found.
From Table 4.2 it is visible that -2 Log likelihood is 49.485. By itself this number is
not very informative. The p-value for our overall model is 0.001 (less than 0.05),
which means that null hypothesis is rejected and there is evidence that at least one
of the explanatory variables contributes to the prediction of the outcome.
Cox & Snell R square and Nagelkerke R square are both methods of calculating the
explained variation. For our model the explained variation ranges from .013 to 0.019
depending on whether we reference Cox & Snell R square or Nagelkerke R square,
respectively. Nagelkerke R square is the modification of Cox & Snell R square and is
more preferable to use.
Step 2: In this step, the test assesses the goodness of fit of a statistical model
(Hosmer-Lameshow test).
Table 4.3 Contingency Table for Hosmer and Lemeshow Test
Category_A = No
Category_A = Yes
Observed
Expected
Observed
Expected
Step 1
1
11
11.000
34
34.000
Total
45
Table 4.3 shows that observed proportions of events are rather similar to the
predicted probabilities of occurrence in 16 subgroups.
Step 3: Deciding as to whether the differences can be explained by chance
only
In order to decide whether the differences can be explained by chance only, the
study performed Hosmer-Lemeshow chi-square test. Based on Table 4.4
hereunder, we can see that p-value is 0.001, which is less than 0.05. This value
shows that we reject the null hypothesis, which means that actual and predicted
event rates are not similar across 11 deciles.
Table 4.4. Hosmer and Lemeshow test
Step
1
Hosmer and Lemeshow Test
Chi-square
df
.000
Sig.
0
.
Step 4: After overall model evaluation as in the above write up, we analyze how
important each of the variables is. The “Variables in the Equation” Table 4.5 shows
the contribution of each independent variable to the model. Also, the output shows,
if the explanatory variables are significant or not. Table 4.5 is purposely for that:
Table 4.5. Variables in the Equation
Step SourceType-Deep
1a
Geology-recent
sediment
DistanceNPL-<10
DepthWell=<20
ModeCons=
unprotected
ModeCons=drilled
95% C.I.for
EXP(B)
B
S.E. Wald df Sig. Exp(B) Lower Upper
1.974 1.263 2.444 1 .025 7.202 .606 85.599
- 2.157 2.312 1 .963
.038 .001
2.581
3.280
2.707 1.464 3.422 1 .005 14.991 .851 263.982
.410 2.019 .041 1 .108 1.507 .029 78.851
4.191 2 .010
- 2.829 .904 1 .002
.068 .000 17.376
2.689
ModeCons= hand dug 1.702 1.569 1.177 1 .012 5.485 .253 118.761
Ownership=communal -.354 1.261 .079 1 .307
.702 .059
8.317
Risklevel=low
1.250 2 .622
Risklevel=High
-.423 1.415 .089 1 .494
.655 .041 10.493
Risklevel=medium
1.040 1.216 .731 1 .857 2.828 .261 30.657
Constant
-.449 2.623 .029 1 .864
.638
a. Variable(s) entered on step 1: SourceType, Geology, DistanceNPL, DepthWell,
ModeCons, Ownership, Risklevel.
Constant = The expected value of log-odds of dependent variable when all of the
predictor variables equal zero.

B (beta coefficients) are the values for the logistic regression equation for
predicting the response variable from explanatory variables.
For our model the prediction equation is as follows
log(p/1-p) = -.449 + (1.974 x SourceType) – (3.280 x Geology-recent sediment) +
(2.707 x DistanceNPL-<10) + (.410 x DepthWell=<20) – (2.689 x ModeCons=drilled) +
(1.702 x ModeCons= hand dug) – (.354 x Ownership=communal ) – (.423 x
Risklevel=High) + (1.040 x Risklevel=medium)
From Table 4.5 above,

Beta coefficients show the amount of change expected in the log odds when
there is a one unit change in the predictor variable holding all other predictors
constant. For the independent variables that are not significant the
coefficients do not significantly differ from 0. Because these coefficients
are in log odds units, they are often difficult to interpret, and converted into
odd ratios. These values are shown in “Exp (B)'' column.

“S.E”-s are standard errors associated with the coefficients. The standard
error is used to test whether the parameter is significantly different from 0
or not. Standard errors are also used in the calculation of Wald statistic. Also,
they can be used to form a confidence level for the parameter.

“Wald” tests the hypothesis that the constant equals 0. For our model this
hypothesis is accepted because the p-value, which is listed in the “Sig” column
=.864 is greater than the critical p-value (0.05). Therefore, the study
concluded that the constant is not 0.

“Exp(B)”-s are the exponentiations of the beta coefficients, which are the
odds ratios of the predictors. The odds ratio represents that an outcome will
occur given a particular property, compared to the odds of the outcome
occurring in the absence of that property. As mentioned above, the prediction
equation is given in log odds.

“Sig.” is p-value of significance test of beta. Usually the coefficients which pvalues are less than 0.05, are considered to be statistically significant. Based
on the output in Table 4.5, it is observed that some of the explanatory
variables are significant except (Geology, DepthWell, Ownership, and
Risklevel). The p-values of these coefficients are greater than 0.05.

After evaluating the statistical significance of individual coefficients, the
study evaluated the predictive accuracy and discrimination of the model.
Based on the “Classification Table 4.6” output we assess the predictive
accuracy of the model. IBM SPSS sets cutoff value 0.5 as default. The
classification table, where the cut-off value is 0.5 is shown below.
Table 4.6 classification table
Observed
Category_A
No
Yes
Overall Percentage
Predicted
Category_A
No
Yes
Percentage Correct
8
3
72.7
3
32
91.4
87.0

Table 4.6 gives us information that 72.7% of “No” “not contaminated water” were
correctly classified and 91.4% of “Yes” “contaminated water” were correctly
classified. The predictive accuracy for overall model is 87.0%.
3. LOGISTIC REGRESSION FOR ECOLI DRY SEASON WITH SELECTED
INDEPENDENT VARIABLES
Here the study eliminates statistically insignificant variables from the model. Based
on the Table 4.5, we got that (Geology, DepthWell, Ownership=communal,
Risklevel=low, Risklevel=High and Risklevel=medium) were not significant. Next we
implement the same steps as in the last subchapter, but eliminating these variables
from the model. Like in the previous section, let us firstly evaluate the overall model.
“Model Summary” is as follows.
Table 4.7 Model Summary
Step
-2 Log likelihood
Cox & Snell R Square Nagelkerke R Square
a
1
36.363
.266
.399
a. Estimation terminated at iteration number 5 because parameter estimates
changed by less than .001.
It is visible that -2 log likelihood is 36.363a, while it was 49.485a in the full model
with all seven variables. The value of Nagelkerke R square is .399 while it was .019
in the full model with all seven variables. This means a stronger predictive capacity
than before, as for the full model Nagelkerke R square was .019.
Table 4.8 Contingency Table for Hosmer and Lemeshow test
Step 1
1
2
3
4
5
Category_A = No
Observed
Expected
4
4.146
3
2.512
2
1.661
1
1.000
1
1.681
Category_A = Yes
Observed
Expected
1
.854
3
3.488
4
4.339
4
4.000
23
22.319
Total
5
6
6
5
24
Table 4.8 shows that observed proportions of events are rather similar to predicted
probabilities of occurrence in five variables.
Table 4.8 Hosmer and Lemeshow Test
Step
Chi-square
1
.585
df
Sig.
3
.900
“Hosmer and Lemeshow Test” shows that in this case also we fail to reject the null
hypothesis, as p-value is 0.709 and again is larger than 0.05.
Next, we evaluate the significance of independent variables. In our model we
included SourceType, DistanceNPL and ModeCons variables. Using three independent
variables in our model, the following results are obtained.
Table 4.9 Variables in the Equation
B
S.E. Wald
Step 1a SourceType(1) 1.478 1.091 1.836
DistanceNPL(1) 1.765 1.021 2.985
ModeCons
4.822
ModeCons(1)
-.857 1.377 .388
ModeCons(2)
1.200 1.322 .824
Constant
-1.857 1.844 1.014
a. Variable(s) entered on step 1: SourceType,
95% C.I.for EXP(B)
df Sig. Exp(B) Lower
Upper
1 .025 4.386
.517
37.233
1 .005 5.839
.789
43.227
2 .010
1 .002
.424
.029
6.301
1 .012 3.320
.249
44.315
1 .314
.156
DistanceNPL, ModeCons.
Table 4.9 shows that all of the explanatory variables are statistically significant, as
p-values for all of them are less than 0.05.
The classification table below shows the predictive accuracy of the selected variable
model, when cutoff value is 0.5.
Table 4.9 Classification Table
Observed
Category_A
No
Yes
Overall Percentage
Predicted
Category_A
No
Yes
Percentage Correct
5
6
45.5
1
34
97.1
84.8
Table 4.9 gives us information that 45.5% of “No” not contaminated were correctly
classified and 97.3% of “Yes” applicants were correctly classified. The predictive
accuracy for overall model is 84.8%.
Conclusion: In this PhD thesis I used real data from 51 observations both “YES” and
“NO”. The dependent variable took 2 values: “0” or “1” depending on whether the
water was contaminated or not. 7 explanatory variables were included in our model
and all were ordinal variables. The study conducted binary logistic regression in IBM
SPSS software version 25, which calculated the predicted probability of the event.
The study excluded all four non-significant variables from the model. By using the
final model, 84.8% of the cases were correctly classified in the case of cutoff value
0.5. Such a value is usually considered as reasonably good.
Download