The Variability and Forecasting of Treatment Plant Influent Water Quality Erin Towler Class Project - CVEN 6833 December 19, 2005 INTRODUCTION As new drinking water regulations come into effect, utilities are faced with the challenge of meeting new regulations without jeopardizing their compliance with existing regulations. This is a formidable task, considering financial constraints coupled with the fact that water sources are becoming scarcer and often more polluted. To this end, an electronic decision tool has been proposed that will help utilities to strategize simultaneous compliance methods. The decision tool will need to consider a utility’s characteristics, including its influent water quality and treatment processes. This paper is concerned with characterizing the variability in treatment plant influent water quality and forecasting future values. This is important in deciding how to simulate influent water quality parameters that will be sent through the decision tool. This paper compiles relevant diagnostic information on influent water quality on a national level. First, national averaged data was examined to determine variable relationships. Next, spatial variability of key influent water parameters were examined. Finally, the perspective shifts to look at nine utilities in the Colorado Front Range. A principle component analysis was conducted to try to resolve the spatial-temporal variability and to forecast future pH and alkalinity values. DATASET This paper utilizes the data from the United States Environmental Protection Agency’s (EPA) monitoring program, called the Information Collection Rule (ICR). The ICR’s Auxillary 1 database contains eighteen months of historical data (July 1997 – December 1998) from over 400 utilities from around the United States. The spatial distribution of the continental water utilities that participated in the ICR are seen in Figure 1. The database contains a large amount of data regarding influent, intermediary, and finished water quality parameters as well as information about the processes of each treatment plant. Figure 1 Spatial distribution of water utilities that participated in the Information Collection Rule (ICR). METHODOLOGY The analyses completed in this paper were completed using Microsoft Access and the statistical program R. For all analyses that involved averages, only averages of available data were computed, any missing values were skipped. In all of the contour plots, annual averages were only taken on utilities that had at least 8 of 12 months of complete monthly data, unless noted. In analyses where all 18 months were needed, any missing values were replaced with averages over the remaining months. To examine parameter relationships, both the parametric linear correlation and the nonparametric mutual information calculations were employed. To look at the spatial variability of parameters, a local polynomial was fit to the data and contoured. A further discussion of local polynomial method can be seen in Loader (1999). The principle component analysis (PCA) was completed following Storch and Zwiers (1999). The forecasting aspect of the PCA analysis proceeded by fitting a locfit relationship between the first PC of alkalinity and the first PC of pH. The first PC of alkalinity was simulated using the k-nearest neighbor method. Only one simulation of 18 months was completed, and only the last 12 months were used (so as to simulate from where the original data left off). Then, based on the previously fit local polynomial relationship, the corresponding first PC of pH’s were calculated. The other PCs were bootstrapped and the whole PC matrix was multiplied by its respective eigen vector to return to the original space. One can see Regonda et. al. (2005) for more details. RESULTS Variable Relationships Linear correlations and mutual information (MI) was computed for all of the variables in order to determine if there were relationships between influent parameters. If the linear correlation or the MI indicated that there was a relationship, scatterplots for those relationships were examined. Figure 2 shows the scatterplots from the relationships between alkalinity, calcium hardness, total hardness, and pH. Strong correlations were found between total hardness and calcium hardness, total hardness and alkalinity, and calcium hardness and alkalinity. This is what one would expect, since calcium hardness and total hardness are simply different measures of the same thing, and are the main contributors to alkalinity. A lower, but significant relationship was also found between pH and alkalinity, as well as with pH and total calcium and pH and total hardness. In all of the plots that include pH, a strong nonlinear relationship is observed. Of note are the very low values of alkalinity that correspond to low pH values – when providing ensembles for the decision tool, these values may need to be generated together. Figure 3 shows the correlations between TOC, UV254, and temperature. TOC and UV254 are strongly correlated, as one would expect since UV254 is a measure of TOC. Also temperature and TOC and UV254 yielded some correlation, but only linear (not MI). This indicates that the relationship with temperature needs to be further examined, perhaps looking at the relationship between monthly TOC and monthly temperature (i.e. not averaged values). As a point of confirmation, the results found in this section are consistent with other findings (Zachman 2005) and these relationships will shape the way the rest of the analyses are approached. Complete correlation tables can be seen in Appendix A. Figure 2 Scatterplots of related variables: alkalinity, calcium hardness, total hardness, and pH. Uses last 12 months of averaged data from continental US. Figure 3 Scatterplots of related variables. UV_254 and TOC are highly related, while temperature and TOC and temperature and UV_254 are less related. Uses last 12 months of averaged data from continental US. Spatial Variability By creating contour plots of the average influent water quality parameters using local polynomial method, geographic trends can be seen. Alkalinity and pH Figure 4 shows a contour plot of the fitted average alkalinity over the last 12 months of the study. Alkalinity often comes from the leaching of calcium carbonate from rocks and soil. High levels of calcium carbonate are found in limestone, and in general, the eastern part of the United States has higher alkalinity than the western US because there is more limestone. The upper northeast has lower alkalinity because much of the limestone in that area has been scoured away by glaciers. This is consistent with what we see in Figure 4. It would be expected that a contour plot of calcium hardness and total hardness would garner the same relationship, based on the high correlation that was seen in the previous section. The alkalinity plot can also be compared by looking at a map of the interpolated raw data in Figure 5. This plot is less smooth, but allows one to focus in on local values. The other plot is better for seeing general trends. Figure 6 shows the standard errors from the fit – notice that there are higher errors on the outskirts of the original data. In addition, there are some higher errors internally. Figure 4 Average alkalinity in ppm CaCO3 over the last 12 months of the study. Figure 5 Contour plot of the interpolated raw data for average alkalinity. Figure 6 Plot of the standard errors from locfit for average alkalinity. A plot of the standard deviation of annual alkalinity can be seen in Figure 7. The highest standard deviation can be seen below the Great Lakes, in Illinois, Iowa, and Missouri. This plot shows that although there is some seasonality, in general, alkalinity is not all that different with the seasons. Figure 7 Standard deviation of annual alkalinity. The first eigen vector of alkalinity was also examined in order to see if the space/time decomposition was consistent with the previous findings. This can be seen in Figure 8. One can see negative weights along much of the east coast and then positive weights in the central United States. The Western United States is mostly neutral, but with some positive weights. Figure 8 Plot of the first eigen vector of alkalinity. Finally, the local polynomial model was checked to see how good the fit was. The observed alkalinity is compared to the cross-validated estimate of the raw data in Figure 9. There is a fair amount of scatter, especially in the higher observed alkalinities. Figure 10 shows a 3D plot of the observed versus the cross-validated estimates, again showing that the model does not capture all of the observed alkalinities, especially in the higher range. Figure 9 Cross-validated estimates of the raw data with a one-to-one line overlaid. Figure 10 3D scatterplot of the observed alkalinity (filled red circles) and the cross-validated estimate of the alkalinity (empty black circles). The contour plot of the fitted average pH over the last 12 months of the study can be seen in Figure 11. Based on the relationship that was identified in the previous section, one would expect that the trends in standard errors would be similar. Figure 11 Average pH over the last 12 months of the study. Total Organic Carbon (TOC) TOC averages over the last 12 months of the study were also examined for their spatial variability. TOC concentrations differ greatly between surface water and groundwater sources, warranting a separate examination. Figure 12 shows the surface water contour plot of TOC concentrations. One can notice the higher contour values corresponding to the center of the US, which is largely agricultural and probably prone to erosion. Also, Florida has much higher TOC values than the rest of the country. This can loosely be attributed to the “swampy” nature of Florida. Figure 13 shows analyses that had been done previously on the same ICR data. This shows average TOC by state. In general, the plots show the same trends, but the local polynomial fit is not bound by arbitrary state lines, and uses interpolation. The averaged by state map does do a nice job of showing where there is no data, whereas the local polynomial map creates extrapolated concentrations in the north central part of the country, where there is no data. Figure 14 shows the groundwater TOC concentrations. Here, the values are generally lower than in the surface waters. However, Florida again shows very high TOC values. Similar information is gained from Figure 15, but it lacks interpolation capability. Figure 12 Local polynomial contour plot of the surface water sources: Last 12 months of study average TOC (ppm). Utilities contributing data are overlaid. Figure 13 Surface water plot, last 12 months of study average TOC (ppm) from Cadmus online http://www.cadmusonline.net/twg/epaweb3/sect3/Sect3Q1/Quest1.asp?Anal=TOC&Out=Map Figure 14 Local polynomial contour plot of the groundwater sources: last 12 months of study average TOC (ppmC). Utilities contributing data are overlaid. Utilizes all locations, regardless of the number of complete months of data. See Appendix for map of locations that use at least 8 months of data. Figure 15 Groundwater last 12 months of study average TOC (ppm) from Cadmus online http://www.cadmusonline.net/twg/epaweb3/sect3/Sect3Q1/Quest1.asp?Anal=TOC&Out=Map Contour plots of standard deviations for surface water and groundwater are shown in Figure 16 and Figure 17, respectively. For surface water, the central US has higher standard deviations. In addition, one can see that Florida also has the highest standard deviations1. There is less information gained from the groundwater countour plot, since the locations are sparse. There are higher standard deviations found in Florida, but careful examination shows that the standard deviations are lower in near all the places where there are data. Figure 16 Surface water standard deviation contour map. Figure 17 Groundwater standard deviation contour map. 1 To show the higher standard deviations, some non-equally spaced contour line were added. Bromide Bromide was examined in the context of spatial variation. In general bromide concentrations in ppm of bromide ar relatively low, with increases in the southwest and Texas, as well as in the northeast. Figure 18 Average bromde concentration in ppm bromide over the last 12 months of the study. Turbidity Turbidity was also examined in the context of a spatial variation. However, this analysis did not have quite as conclusive results. The range in turbidity was great, with a number of outliers that were not geographically related. Rather, the water sources with the highest turbidity were typically large, widely used rivers such as the Missouri R., the Mississippi R., the Rio Grande R., the Ohio R., as well as others. Thus, the spatial analysis was not all that useful, and rather turbidity will likely need to be examined in a more case by case basis. Nevertheless, Figure 19 shows the average turbidity for all utilities whose average turbidity value was less than 25 NTU. Forty-six utilities were not included using this criteria. To further illuminate this dataset, Figure 20 shows a histogram of the data used to generate Figure 19. Again it should be noted that the majority of turbidities are extremely low, and that one may need to take a closer look at turbidity before making any generalizations. Figure 19 Average turbidity in NTU over the last 12 months of the study (only utilities with average turbidity values less than 25 NTU were included). Figure 20 Histogram of values used in Figure 19. PCA Analysis Nine of the Front Range utilities in Colorado were examined in a PCA analysis that looked at alkalinity and pH. The nine utilities were broken into two groups based on differences in pH and alkalinity, shown in Figure 21 and Figure 22. Group 1 (the left group) consists of two utilities in Aurora, one Denver utility, and one Pueblo utility. Group 2 (the right group) consists of Boulder, one Denver utility, two Colorado Springs utilities, and a Fort Collins utility. This was just done to aid in viewing the PCA and forecast results. Figure 21 Left shows “Group 1” and right shows “Group 2” alkalinity time series. Figure 22 Left “Group 1” and right shows “Group 2” pH time series. Figure 23 shows the 9 utilities that were used, broken into two groups based on differences in pH and alkalinity shown in Figure 21 and Figure 22. The red triangles are Group 1, and the blue circles are Group 2. Figure 23 Blue circles (Group 2) include Fort Collins, Boulder, 1 Denver plant, and 2 Colorado Springs plants (which are on top of one another). Red triangles symbolize 2 Aurora plants, Pueblo, and 1 Denver plant (Group 1). A PCA was completed that used all of the data for the nine utilities (Groups 1 and 2 were combined). The Eigen spectrums were computed and can be seen in Appendix B. The first three PCs of each pH and alkalinity were scatterplotted with one another. This can be seen Figure 24. Analyzing all of the scatterplots, a similar nonlinear relationship is seen in the scatterplot of the first PCs of pH and alkalinity. Therefore, the rest of the PC analysis will be limited to the first leading PCs. Figure 24 Scatterplots of the first three leading components of pH and alkalinity with locfit lines through the points. The first eigen vector of alkalinity is spatially plotted in Figure 25 and zoomed in Figure 26. This shows that most of the locations contribute negative weights to the first PC, and contours changing in a northwest to northeast direction. The first eigen vector of pH is spatially plotted in Figure 27 and zoomed in Figure 28. Again, most of the locations contribute a negative weight to the first PC, in a west to east direction. Figure 25 First eigen vector of alkalinity Figure 26 First eigen vector, zoomed in (note the “uneven” scale). Figure 27 1st Eigen vector of pH. Figure 28 First eigen vector of pH zoomed in (note the “uneven” scale). Figure 29 and Figure 30 show the time series of the PC values and their corresponding spectrum. As we would expect with such limited data, the only pattern being picked up is the annual cycle. This can be seen in the spectrum peak around 1 cycle per year. Figure 29 First PC plot and smooth spectrum for alkalinity. Figure 30 First PC plot and smooth spectrum for pH Using the forecast method described in the methodology section, the 12 months following the historical data were forecast for all nine utilities. Next the two groups were separated (again, just for visual purposes), and the resulting forecasts are shown to the right of the vertical line in Figure 31 and Figure 32. The top plots in each figure are Group 1 and the bottom plot is Group 2. In each figure, the first 18 months of original data are shown, and then 12 more months have been forecast. One can see that each time series forecast (coded by color) seems to follow the same subtle seasonal trend as the original data. Figure 31 Group 1 (top) and Group 2 (bottom) historical and forecast pHs. The original data is to the left of the vertical line and the forecast is to the left of the vertical line. Different point colors correspond to different utilities within the group. Figure 32 Group 1 (top) and Group 2 (bottom) historical and forecast alkalinity. The original data is to the left of the vertical line and the forecast is to the left of the vertical line. Different point colors correspond to different utilities within the group. CONCLUSIONS The field of drinking water quality has not utilized many of the advanced statistical methods that have been used in other branches of water resources. The aim of this paper was to experiment with some advanced statistical methods, such as local polynomial method and principle component analysis in the field of water quality. Local polynomial allowed the viewing of spatial trends across the United States. This will be useful in pinpointing areas that might be susceptible to certain regulation violations. The ability to forecast using the PCA analysis was of a more limited value. The benefit was being able to forecast water quality data over multiple locations at the same time. However, due to the short nature of the time series (18 months), it was only able to capture the annual cycle, and was not useful in showing long-term trends or any hydroclimatic connections, such as with ENSO. This paper has shown that there is potential for the utilization of advanced statistical trends in the field of drinking water. However, a limiting factor is the short nature of the time series. One recommendation for future study would be to utilize datasets characterizing water quality in reservoirs. In places like Colorado where snowmelt is the dominant source for reservoirs, concurrent snowmelt records could also be obtained. Looking at longer records would be more likely to yield meaningful conclusions in a time series analysis framework. Appendix A Table 1 Linear correlations of 18 months of averaged data for the continental US. Alkalinity Calcium Hardness pH Temp TOC Total Hardness UV_254 Turbidity BROMIDE NH3_N Alkalinity 1 Calcium Hardness - 0.81 0.42 0.23 0.08 0.85 0.03 0.12 -0.02 0.30 1 0.35 0.23 0.15 0.89 0.08 0.15 0.03 0.24 pH 1 0.13 0.02 0.37 -0.05 0.21 -0.06 0.05 Temp 1 0.32 0.17 0.37 0.04 0.02 0.16 TOC 1 0.08 0.92 0.14 0.04 0.21 Total Hardness - UV_254 - Turbidity - BROMIDE - NH3_N - 1 0.00 0.14 0.03 0.20 1 0.11 0.02 0.16 1 -0.01 0.11 1 0.03 1 Total Hardness - UV_254 - Turbidity - BROMIDE - NH3_N - - - - Table 2 Mutual information of 18 months averaged data for the continental US. Alkalinity Alkalinity Calcium Hardness pH Temp TOC Total Hardness UV_254 Turbidity BROMIDE NH3_N 0.63 0.37 0.12 0.07 0.69 0.04 -0.04 -0.10 0.025 Calcium Hardness - pH - 0.27 0.08 0.05 0.76 0.03 -0.03 -0.08 0.01 0.07 0.06 0.27 0.04 -0.02 -0.10 -0.02 Temp - TOC - - - 0.10 0.05 0.10 -0.02 -0.09 0.01 0.04 0.50 0.00 -0.11 0.01 0.02 -0.03 -0.07 -0.01 0.00 -0.12 -0.01 -0.18 -0.09 -0.16 - Appendix B Figure 33 Eigen spectrum for alkalinity PCs. Figure 34 Eigen spectrum for pH PCs. Figure 35 Local polynomial contour plot of the groundwater sources: last 12 months of study average TOC (ppmC). Utilities contributing data are overlaid. Utilizes locations where there were 8 or more complete months of data. References Loader, Clive. 1999. Local Regression and Likelihood. New York: Springer. Regonda, S., B. Rajagopalan, M. Clark and E. Zagona, Multi-model Ensemble Forecast of Spring Seasonal Flows in the Gunnison River Basin (in review) Water Resources Research, 2005. Storch, H.V. and F.W. Zwiers. 1999. Statistical analysis in climate research. New York: Cambridge University Press. Zachman, B. 2005. Understanding and Predicting Natural Organic Matter Adsorption by Granular Activated Carbon Adsorbers. (Masters of Science Thesis: University of Colorado). Boulder, CO.