E. coli Prediction Using Regression Analysis

1. MULTIPLE LINEAR REGRESSION FOR ECOLI DRY SEASON WITH ALL INDEPENDENT VARIABLES In this thesis the IBM SPSS software version 26 was used to conduct Multiple Linear regression. Let us see what happened when we used all 7 explanatory variables (Risklevel, ModeCons, SourceType, DepthWell, DistanceNPL, Geology and Ownership) as predictors in our model. After processing Multiple Linear regression, statistical outputs were generated. Step 1: First the study will try to implement the overall model evaluation. Model Summary Std. Error of the Model R R Square Adjusted R Square Estimate 1 .559a .312 .186 .389 a. Predictors: (Constant), Risklevel, Ownership, DepthWell, SourceType, Geology, DistanceNPL, ModeCons   R= the multiple correlation coefficient. The value of 0.559 in this model indicates a good level of prediction. The figure reveals a strong correlation between Ecoli & the seven predicator variables. R Square The value of 0.312 in this model indicates the coefficient of determination obtained as a result of squaring the correlation coefficient (R2). The value of 0.312 reveals that our independent variables explain 31.2% of the variability of Ecoli (dependent variable).  Adjusted R Square: The value of 0.186 in this model reveals the accuracy of the model. A value of 0.186 in this Model indicates true 18.6% of variation in the outcome variable (Ecoli) is explained by the predictors which are to keep in the model.  Std. Error of the Estimate: The value of 0.389 is a measure of the precision of the model. The value reveals how wrong you could be if you used the regression model to make predictions or to estimate the Ecoli, in this case 38.9%. Step 2: Examining the Statistical significance ANOVAa Sum of Model Squares df Mean Square F Sig. 1 Regression 2.615 7 .374 2.467 .034b Residual 5.755 38 .151 Total 8.370 45 a. Dependent Variable: Category_A b. Predictors: (Constant), Risklevel, Ownership, DepthWell, SourceType, Geology, DistanceNPL, ModeCons The F-ratio in the ANOVA table above tests whether the overall regression model is a good fit for the data. The table shows that the independent variables statistically significantly predict the dependent variable, F(7, 38) = 2.467, p < 0.05 (i.e., the regression model is a good fit of the data). Step 3: Estimated model coefficients Coefficientsa Unstandardized Coefficients Model B Std. Error 1 (Constant) .735 .490 SourceType -.168 .177 Geology .123 .135 DistanceNPL -.459 .188 DepthWell .299 .223 ModeCons .063 .141 Ownership -.037 .132 Risklevel -.019 .084 a. Dependent Variable: Category_A Standardized Coefficients Beta -.141 .139 -.408 .236 .083 -.042 -.031 t 1.499 -.949 .912 2.436 1.337 .449 -.276 -.222 Sig. .142 .349 .367 .020 .189 .656 .784 .826 95.0% Confidence Interval for B Lower Upper Bound Bound -.258 1.728 -.526 .190 -.150 .395 -.841 -.078 -.154 -.223 -.304 -.188 .751 .349 .231 .151 1. The general form of the equation to predict Ecoli from SourceType, Geology, DistanceNPL, DepthWell, ModeCons, Ownership, Risklevel is: Ecoli= 0.735 – (.168x SourceType) –(.123x Geology) –(.459x DistanceNPL) + (.299x DepthWell) + (.063x ModeCons) – (.037x Ownership) – (.019x Risklevel) 2. Unstandardized coefficients indicate how much the dependent variable (Ecoli) varies with an independent variable when all other independent variables are held constant. Example: The unstandardized coefficient, B1, for SourceType =-.168. This means that for each one unit decrease in SourceType, there is a decrease in Ecoli of 0.168. Step 4. Statistical significance of the independent variables Coefficientsa Unstandardized Coefficients Model B Std. Error 1 (Constant) .735 .490 SourceType -.168 .177 Geology .123 .135 DistanceNPL -.459 .188 DepthWell .299 .223 ModeCons .063 .141 Ownership -.037 .132 Risklevel -.019 .084 a. Dependent Variable: Category_A Standardized Coefficients Beta -.141 .139 -.408 .236 .083 -.042 -.031 t 1.499 -.949 .912 2.436 1.337 .449 -.276 -.222 Sig. .142 .349 .367 .020 95.0% Confidence Interval for B Lower Upper Bound Bound -.258 1.728 -.526 .190 -.150 .395 -.841 -.078 .189 .656 .784 .826 -.154 -.223 -.304 -.188 .751 .349 .231 .151 The study tested for the statistical significance of each of the independent variables. This test is meant to verify whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, we can conclude that the coefficients are statistically significantly different to 0 (zero). Observation: From the "Sig." column, almost all the independent variable coefficients are not statistically significant except Only Distance from the nearest pit latrine (DistanceNPL). Step 5. Summary A multiple regression was run to predict Ecoli from SourceType, Geology, DistanceNPL, DepthWell, ModeCons, Ownership and Risklevel. These variables did not statistically significantly predicted Ecoli, F(7, 38) = 2.467, p < 0.05, R2 = .312. Only one variable (DistanceNPL) added statistically significantly to the prediction, p < .05. 2. BINARY LOGISTIC REGRESSION FOR ECOLI DRY SEASON WITH ALL INDEPENDENT VARIABLES In this thesis the IBM SPSS software version 26 was used to conduct logistic regression. Let us see what happened when we used all 7 explanatory variables (Risklevel, ModeCons, SourceType, DepthWell, DistanceNPL, Geology and Ownership) as predictors in our model. After processing binary logistic regression, statistical outputs were generated. Based on the “Case Processing Summary” output it is visible that 45 cases were used out of 51. It is explained by the fact, that six cases included missing data. Table 4. 1 Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 45 88.2 Missing Cases 6 11.8 Total 51 100.0 Unselected Cases 0 .0 Total 51 100.0 a. If weight is in effect, see classification table for the total number of cases. Step 1: First the study will try to implement the overall model evaluation. Table 4.2 Model Summary Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square a 1 49.485 .013 .019 a. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found. From Table 4.2 it is visible that -2 Log likelihood is 49.485. By itself this number is not very informative. The p-value for our overall model is 0.001 (less than 0.05), which means that null hypothesis is rejected and there is evidence that at least one of the explanatory variables contributes to the prediction of the outcome. Cox & Snell R square and Nagelkerke R square are both methods of calculating the explained variation. For our model the explained variation ranges from .013 to 0.019 depending on whether we reference Cox & Snell R square or Nagelkerke R square, respectively. Nagelkerke R square is the modification of Cox & Snell R square and is more preferable to use. Step 2: In this step, the test assesses the goodness of fit of a statistical model (Hosmer-Lameshow test). Table 4.3 Contingency Table for Hosmer and Lemeshow Test Category_A = No Category_A = Yes Observed Expected Observed Expected Step 1 1 11 11.000 34 34.000 Total 45 Table 4.3 shows that observed proportions of events are rather similar to the predicted probabilities of occurrence in 16 subgroups. Step 3: Deciding as to whether the differences can be explained by chance only In order to decide whether the differences can be explained by chance only, the study performed Hosmer-Lemeshow chi-square test. Based on Table 4.4 hereunder, we can see that p-value is 0.001, which is less than 0.05. This value shows that we reject the null hypothesis, which means that actual and predicted event rates are not similar across 11 deciles. Table 4.4. Hosmer and Lemeshow test Step 1 Hosmer and Lemeshow Test Chi-square df .000 Sig. 0 . Step 4: After overall model evaluation as in the above write up, we analyze how important each of the variables is. The “Variables in the Equation” Table 4.5 shows the contribution of each independent variable to the model. Also, the output shows, if the explanatory variables are significant or not. Table 4.5 is purposely for that: Table 4.5. Variables in the Equation Step SourceType-Deep 1a Geology-recent sediment DistanceNPL-<10 DepthWell=<20 ModeCons= unprotected ModeCons=drilled 95% C.I.for EXP(B) B S.E. Wald df Sig. Exp(B) Lower Upper 1.974 1.263 2.444 1 .025 7.202 .606 85.599 - 2.157 2.312 1 .963 .038 .001 2.581 3.280 2.707 1.464 3.422 1 .005 14.991 .851 263.982 .410 2.019 .041 1 .108 1.507 .029 78.851 4.191 2 .010 - 2.829 .904 1 .002 .068 .000 17.376 2.689 ModeCons= hand dug 1.702 1.569 1.177 1 .012 5.485 .253 118.761 Ownership=communal -.354 1.261 .079 1 .307 .702 .059 8.317 Risklevel=low 1.250 2 .622 Risklevel=High -.423 1.415 .089 1 .494 .655 .041 10.493 Risklevel=medium 1.040 1.216 .731 1 .857 2.828 .261 30.657 Constant -.449 2.623 .029 1 .864 .638 a. Variable(s) entered on step 1: SourceType, Geology, DistanceNPL, DepthWell, ModeCons, Ownership, Risklevel. Constant = The expected value of log-odds of dependent variable when all of the predictor variables equal zero.  B (beta coefficients) are the values for the logistic regression equation for predicting the response variable from explanatory variables. For our model the prediction equation is as follows log(p/1-p) = -.449 + (1.974 x SourceType) – (3.280 x Geology-recent sediment) + (2.707 x DistanceNPL-<10) + (.410 x DepthWell=<20) – (2.689 x ModeCons=drilled) + (1.702 x ModeCons= hand dug) – (.354 x Ownership=communal ) – (.423 x Risklevel=High) + (1.040 x Risklevel=medium) From Table 4.5 above,  Beta coefficients show the amount of change expected in the log odds when there is a one unit change in the predictor variable holding all other predictors constant. For the independent variables that are not significant the coefficients do not significantly differ from 0. Because these coefficients are in log odds units, they are often difficult to interpret, and converted into odd ratios. These values are shown in “Exp (B)'' column.  “S.E”-s are standard errors associated with the coefficients. The standard error is used to test whether the parameter is significantly different from 0 or not. Standard errors are also used in the calculation of Wald statistic. Also, they can be used to form a confidence level for the parameter.  “Wald” tests the hypothesis that the constant equals 0. For our model this hypothesis is accepted because the p-value, which is listed in the “Sig” column =.864 is greater than the critical p-value (0.05). Therefore, the study concluded that the constant is not 0.  “Exp(B)”-s are the exponentiations of the beta coefficients, which are the odds ratios of the predictors. The odds ratio represents that an outcome will occur given a particular property, compared to the odds of the outcome occurring in the absence of that property. As mentioned above, the prediction equation is given in log odds.  “Sig.” is p-value of significance test of beta. Usually the coefficients which pvalues are less than 0.05, are considered to be statistically significant. Based on the output in Table 4.5, it is observed that some of the explanatory variables are significant except (Geology, DepthWell, Ownership, and Risklevel). The p-values of these coefficients are greater than 0.05.  After evaluating the statistical significance of individual coefficients, the study evaluated the predictive accuracy and discrimination of the model. Based on the “Classification Table 4.6” output we assess the predictive accuracy of the model. IBM SPSS sets cutoff value 0.5 as default. The classification table, where the cut-off value is 0.5 is shown below. Table 4.6 classification table Observed Category_A No Yes Overall Percentage Predicted Category_A No Yes Percentage Correct 8 3 72.7 3 32 91.4 87.0  Table 4.6 gives us information that 72.7% of “No” “not contaminated water” were correctly classified and 91.4% of “Yes” “contaminated water” were correctly classified. The predictive accuracy for overall model is 87.0%. 3. LOGISTIC REGRESSION FOR ECOLI DRY SEASON WITH SELECTED INDEPENDENT VARIABLES Here the study eliminates statistically insignificant variables from the model. Based on the Table 4.5, we got that (Geology, DepthWell, Ownership=communal, Risklevel=low, Risklevel=High and Risklevel=medium) were not significant. Next we implement the same steps as in the last subchapter, but eliminating these variables from the model. Like in the previous section, let us firstly evaluate the overall model. “Model Summary” is as follows. Table 4.7 Model Summary Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square a 1 36.363 .266 .399 a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001. It is visible that -2 log likelihood is 36.363a, while it was 49.485a in the full model with all seven variables. The value of Nagelkerke R square is .399 while it was .019 in the full model with all seven variables. This means a stronger predictive capacity than before, as for the full model Nagelkerke R square was .019. Table 4.8 Contingency Table for Hosmer and Lemeshow test Step 1 1 2 3 4 5 Category_A = No Observed Expected 4 4.146 3 2.512 2 1.661 1 1.000 1 1.681 Category_A = Yes Observed Expected 1 .854 3 3.488 4 4.339 4 4.000 23 22.319 Total 5 6 6 5 24 Table 4.8 shows that observed proportions of events are rather similar to predicted probabilities of occurrence in five variables. Table 4.8 Hosmer and Lemeshow Test Step Chi-square 1 .585 df Sig. 3 .900 “Hosmer and Lemeshow Test” shows that in this case also we fail to reject the null hypothesis, as p-value is 0.709 and again is larger than 0.05. Next, we evaluate the significance of independent variables. In our model we included SourceType, DistanceNPL and ModeCons variables. Using three independent variables in our model, the following results are obtained. Table 4.9 Variables in the Equation B S.E. Wald Step 1a SourceType(1) 1.478 1.091 1.836 DistanceNPL(1) 1.765 1.021 2.985 ModeCons 4.822 ModeCons(1) -.857 1.377 .388 ModeCons(2) 1.200 1.322 .824 Constant -1.857 1.844 1.014 a. Variable(s) entered on step 1: SourceType, 95% C.I.for EXP(B) df Sig. Exp(B) Lower Upper 1 .025 4.386 .517 37.233 1 .005 5.839 .789 43.227 2 .010 1 .002 .424 .029 6.301 1 .012 3.320 .249 44.315 1 .314 .156 DistanceNPL, ModeCons. Table 4.9 shows that all of the explanatory variables are statistically significant, as p-values for all of them are less than 0.05. The classification table below shows the predictive accuracy of the selected variable model, when cutoff value is 0.5. Table 4.9 Classification Table Observed Category_A No Yes Overall Percentage Predicted Category_A No Yes Percentage Correct 5 6 45.5 1 34 97.1 84.8 Table 4.9 gives us information that 45.5% of “No” not contaminated were correctly classified and 97.3% of “Yes” applicants were correctly classified. The predictive accuracy for overall model is 84.8%. Conclusion: In this PhD thesis I used real data from 51 observations both “YES” and “NO”. The dependent variable took 2 values: “0” or “1” depending on whether the water was contaminated or not. 7 explanatory variables were included in our model and all were ordinal variables. The study conducted binary logistic regression in IBM SPSS software version 25, which calculated the predicted probability of the event. The study excluded all four non-significant variables from the model. By using the final model, 84.8% of the cases were correctly classified in the case of cutoff value 0.5. Such a value is usually considered as reasonably good.

E. coli Prediction Using Regression Analysis

Related documents

Products

Support

E. coli Prediction Using Regression Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib