Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model Erin Peterson Environmental Risk Technologies CSIRO Mathematical & Information Sciences St Lucia, Queensland Space-Time Aquatic Resources Modeling and Analysis Program The work reported here was developed under STAR Research Assistance Agreement CR-829095 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This presentation has not been formally reviewed by EPA. EPA does not endorse any products or commercial services mentioned in this presentation. This research is funded by U.S.EPA 凡Science Science To ToAchieve Achieve Results (STAR) Program Cooperative Agreement # CR - 829095 Collaborators Dr. David M. Theobald Natural Resource Ecology Lab Department of Recreation & Tourism Colorado State University, USA Dr. N. Scott Urquhart Department of Statistics Colorado State University, USA Dr. Jay M. Ver Hoef National Marine Mammal Laboratory, Seattle, USA Andrew A. Merton Department of Statistics Colorado State University, USA Overview Introduction ~ Background ~ Patterns of spatial autocorrelation in stream water chemistry ~ Predicting water quality impaired stream segments using landscape-scale data and a regional geostatistical model: A case study in Maryland, USA Water Quality Monitoring Goals • Create a regional water quality assessment • • Ecosystem Health Monitoring Program Identify water quality impaired stream segments Probability-based Random Survey Designs Advantages • Statistical inference about population of streams over large area • Reported in stream kilometers Disadvantages • Does not take watershed influence into account • Does not identify spatial location of impaired stream segments Purpose Develop a geostatistical methodology based on coarse-scale GIS data and field surveys that can be used to predict water quality characteristics about stream segments found throughout a large geographic area (e.g., state) Terrestrial COARSE SCALE: Grain Aquatic Landscape Climate Atmospheric deposition Geology River Network Topography Soil Type Network Connectivity Nested Watersheds Stream Network Land Use Vegetation Type Topography Basin Shape/Size Drainage Density Connectivity Confluence Density Flow Direction Network Configuration Segment Contributing Area Segment Tributary Size Differences Network Geometry Localized Disturbances Land Use/ Land Cover Riparian Zone Reach Riparian Vegetation Type & Condition Floodplain / Valley Floor Width Microhabitat FINE Shading Detritus Inputs Cross Sectional Area Channel Slope, Bed Materials Large Woody Debris Substrate Overhanging Vegetation Biotic Condition Microhabitat Biotic Condition, Substrate Type, Overlapping Vegetation Detritus, Macrophytes Geostatistical Modeling Fit an autocovariance function to data • Describes relationship between observations based on separation distance Distances and relationships are represented differently depending on the distance measure 10 Semivariance Sill Nugget 0 0 Range Separation Distance 1000 Distance Measures & Spatial Relationships B A C Straight-line Distance (SLD) Geostatistical models typically based on SLD Distance Measures & Spatial Relationships B A C Symmetric Hydrologic Distance (SHD) Hydrologic connectivity: Fish movement Distance Measures & Spatial Relationships B A C Asymmetric Hydrologic Distance Longitudinal transport of material Distance Measures & Spatial Relationships B A C Challenge: • Spatial autocovariance models developed for SLD may not be valid for hydrologic distances – Covariance matrix is not positive definite Asymmetric Autocovariance Models for Stream Networks • Weighted asymmetric hydrologic distance (WAHD) • Developed by Jay Ver Hoef • Moving average models Flow • Incorporate flow volume, flow direction, and use hydrologic distance • Positive definite covariance matrices Ver Hoef, J.M., Peterson, E.E., and Theobald, D.M., Spatial Statistical Models that Use Flow and Stream Distance, Environmental and Ecological Statistics. In Press. Patterns of Spatial Autocorrelation in Stream Water Chemistry Objectives Evaluate 8 chemical response variables 1. 2. 3. 4. 5. 6. 7. 8. pH measured in the lab (PHLAB) Conductivity (COND) measured in the lab μmho/cm Dissolved oxygen (DO) mg/l Dissolved organic carbon (DOC) mg/l Nitrate-nitrogen (NO3) mg/l Sulfate (SO4) mg/l Acid neutralizing capacity (ANC) μeq/l Temperature (TEMP) °C Determine which distance measure is most appropriate • • • • SLD SHD WAHD More than one? Find the range of spatial autocorrelation Dataset Maryland Biological Stream Survey (MBSS) Data • Maryland Department of Natural Resources – Maryland, USA – 1995, 1996, 1997 • Stratified probability-based random survey design • 881 sites in 17 interbasins Maryland, USA Baltimore Annapolis Washington D.C. Chesapeake Bay Study Area Spatial Distribution of MBSS Data N GIS Tools Automated tools needed to extract data about hydrologic relationships between survey sites did not exist! Wrote Visual Basic for Applications (VBA) programs to: 1. Calculate watershed covariates for each stream segment • Functional Linkage of Watersheds and Streams (FLoWS) 2. Calculate separation distances between sites • SLD, SHD, Asymmetric hydrologic distance (AHD) 3. Calculate the spatial weights for the WAHD 4. Convert GIS data to a format compatible with statistics software FLoWS tools will be available on the STARMAP website: http://nrel.colostate.edu/projects/starmap 1 2 2 1 3 SLD 3 SHD 1 2 3 AHD Spatial Weights for WAHD Proportional influence (PI): influence of each neighboring survey site on a downstream survey site • Weighted by catchment area: Surrogate for flow volume 1. Calculate the PI of each upstream segment on segment directly downstream Watershed Segment B Watershed Segment A A 2. Calculate the PI of one survey site on another site • Flow-connected sites • Multiply the segment PIs B C Segment PI of A = Watershed Area A Watershed Area B Spatial Weights for WAHD Proportional influence (PI): influence of each neighboring survey site on a downstream survey site • Weighted by catchment area: Surrogate for flow volume 1. Calculate the PI of each upstream segment on segment directly downstream A C B E 2. Calculate the PI of one survey site on another site • Flow-connected sites • Multiply the segment PIs D F G H survey sites stream segment Spatial Weights for WAHD Proportional influence (PI): influence of each neighboring survey site on a downstream survey site • Weighted by catchment area: Surrogate for flow volume 1. Calculate the PI of each upstream segment on segment directly downstream A C B E 2. Calculate the PI of one survey site on another site • Flow-connected sites • Multiply the segment PIs D F G H Site PI = B * D * F * G Data for Geostatistical Modeling 1. Distance matrices • SLD, SHD, AHD 2. Spatial weights matrix • Contains flow dependent weights for WAHD 3. Watershed covariates • Lumped watershed covariates – Mean elevation, % Urban 4. Observations • MBSS survey sites Geostatistical Modeling Methods Validation Set • Unique for each chemical response variable Initial Covariate Selection • 5 covariates Model Development • • Restricted model space to all possible linear models 4 model sets: Response ANC (μeq/l) COND (μmho/cm) DOC (mg/l) DO (mg/l) NO3 (mg/l) pH Lab SO4 (mg/l) TEMP (°C) Significant Covariates PASTUR, LOWURB, WOODYWET, YR96, YR97 HIGHURB, LOWURB, COALMINE, YR96, NORTHING WOODYWET, CONIFER, MIXEDFOR, LOWURB, NORTHING DECIDFOR, HIGHURB, WOODYWET, YR96, YR97 PASTUR, PROBCROP, ROWCROP, LOWURB, WATER PROBCROP, DECIDFOR, WOODYWET, ACREAGE, CONIFER LOWURB, COALMINE, NORTHING, ER67, ER69 PROBCROP, LOWURB, WATER, YR96, YR97 Geostatistical Modeling Methods Geostatistical model parameter estimation • Maximize the profile log-likelihood function Log-likelihood function of the parameters ( , , 2 ) given the observed data Z is: ( , , 2 ; Z ) n 1 1 log( 2 ) log 2 ( Z X )' 1 ( Z X ) 2 2 2 2 Maximizing the log-likelihood with respect to B and sigma2 yields: ˆ ( X ' 1 X ) 1 X ' 1Z and ( Z X ˆ ) ' 1 ( Z X ˆ ) ˆ n 2 Both maximum likelihood estimators can be written as functions of alone Derive the profile log-likelihood function by substituting the MLEs ( ˆ , ˆ ) back into the log-likelihood function 2 n n 1 n profile( ; ˆ , ˆ 2 , Z ) log( 2 ) log( ˆ 2 ) log 2 2 2 2 Geostatistical Modeling Methods Covariance matrix for SLD and SHD models • Fit exponential autocorrelation function 1 C1 (h;1 , 2 ) (1 1 ) exp(h / 2 ) if h 0 if h 0 where C1 is the covariance based on the distance between two sites, h, given the autocorrelation parameter estimates: nugget (0 ), sill (1 ), and range ( 2 ). Covariance matrix for WAHD model • Fit exponential autocorrelation function (C1) • Hadamard (element-wise) product of C1 & square root of spatial weights matrix forced into symmetry ( jB w j ) D 0 C ( si , s j | ) C1 (0) 0 jBD w j C1 (h) locations are not flow connected, if location 1 = location 2, otherwise. Geostatistical Modeling Methods Model selection within model set • • GLM: Akaike Information Corrected Criterion (AICC) Geostatistical models: Spatial AICC (Hoeting et al., in press) AICC 2 profile( ; , 2 , Z ) 2n p k 1 n pk 2 where n is the number of observations, p-1 is the number of covariates, and k is the number of autocorrelation parameters. http://www.stat.colostate.edu/~jah/papers/spavarsel.pdf Model selection between model types • • • 100 Predictions: Universal kriging algorithm Mean square prediction error (MSPE) Cannot use AICC to compare models based on different distance measures Model comparison: r2 for observed vs. predicted values Results Summary statistics for distance measures • Spatial neighborhood differs • Affects number of neighboring sites • Affects median, mean, and maximum separation distance Summary statistics for distance measures in kilometers using DO (n=826). Distance Measure N Pairs Min Median Mean Max Straight Line Distance 340725 0.05 101.02 118.16 385.53 Symmetric Hydrologic Distance 62625 0.05 156.29 187.10 611.74 Pure Asymmetric * Hydrologic Distance 1117 0.05 4.49 5.83 27.44 * Asymmetric hydrologic distance is not weighted here Results Range of spatial autocorrelation differs: • • • Shortest for SLD TEMP = shortest range values DO = largest range values 180.79 100.00 Mean Range Values SLD = 28.2 km SHD = 88.03 km WAHD = 57.8 km 301.76 90.00 Range (km) 80.00 70.00 SLD 60.00 SHD 50.00 40.00 WAHD 30.00 20.00 10.00 0.00 ANC COND DOC DO NO3 PHLAB SO4 TEMP Results Distance Measures: • • GLM always has less predictive ability More than one distance measure usually performed well • SLD, SHD, WAHD: PHLAB & DOC • SLD and SHD : ANC, DO, NO3 • WAHD & SHD: COND, TEMP SLD distance: SO4 • ANC DOC COND 350000.00 40000.00 300000.00 35000.00 9.00 2.50 GLM 8.00 2.00 7.00 30000.00 250000.00 6.00 25000.00 1. 5 0 200000.00 5.00 20000.00 15 0 0 0 0 . 0 0 4.00 15 0 0 0 . 0 0 10 0 0 0 0 . 0 0 5000.00 0.00 0.00 GLM SL SH 1. 0 0 3.00 10 0 0 0 . 0 0 50000.00 MSPE DO 2.00 0.50 1. 0 0 0.00 0.00 GLM WAH SL SH WAH GLM PHLAB NO3 1. 2 0 SL SH GLM WAH SO4 0 . 18 400.00 0 . 16 350.00 1. 0 0 SL SH WAH TEMP SLD SHD 9.00 8.50 0 . 14 300.00 0 . 12 0.80 250.00 8.00 0 . 10 0.60 0.40 0.20 0.06 15 0 . 0 0 0.04 10 0 . 0 0 0.02 50.00 GLM SL SH WAH 7.50 7.00 0.00 0.00 WAHD 200.00 0.08 0.00 GLM SL SH WAH 6.50 GLM SL SH WAH GLM SL SH WAH Results Predictive ability of models: r2 Strong: ANC, COND, DOC, NO3, PHLAB Weak: DO, TEMP, SO4 1.00 0.90 0.80 GLM 0.70 0.60 SLD R2 r2 0.50 0.40 SHD 0.30 WAHD 0.20 0.10 0.00 ANC COND DOC DO NO3 PHLAB SO4 TEMP Discussion Distance measure influences how spatial relationships are represented in a stream network • • Site’s relative influence on other sites Dictates form and size of spatial neighborhood Important because… • Impacts accuracy of the geostatistical model predictions SLD SHD WAHD Patterns of spatial autocorrelation found at relatively coarse scale • Geostatistical models describe more variability than GLM SLD, SHD, and WAHD represent spatial autocorrelation in continuous coarse-scale variables SLD • > 1 distance measure performed well • SLD never substantially inferior • Do not represent movement through network Different range of spatial autocorrelation? • Larger SHD and WAHD range values • Separation distance larger when restricted to network SHD Discussion Probability-based random survey design (-) affected WAHD • Maximize spatial independence of sites • Does not represent spatial relationships in networks • Validation sites randomly selected 275 244 244 sites did not have neighbors Sample Size = 881 Number of sites with ≤1 neighbor: 393 Mean number of neighbors per site: 2.81 Frequency 149 133 109 66 38 35 32 12 19 7 15 13 6 1 0 0 2 13 14 15 16 17 0 0 1 2 3 4 5 6 7 8 9 10 11 Number of Neighboring Sites 12 Discussion WAHD models explained more variability as neighboring sites increased Not when neighbors had: Similar watershed conditions Significantly different chemical response values 4500 4500 WAHD GLM Difference (O – E) • • 00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Neighboring Sites Discussion GLM predictions improved as number of neighbors increased • Clusters of sites in space have similar watershed conditions – Statistical regression pulled towards the cluster • GLM contained hidden spatial information – Explained additional variability in data with > neighbors 4500 4500 Difference (O – E) WAHD GLM 00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Neighboring Sites Predictive Ability of Geostatistical Models Coarse COND Scale of influential ecological processes SO4 ANC PH NO3 DOC TEMP DO Fine 0 0.5 r2 1.0 Conclusions 1) Spatial autocorrelation exists in stream chemistry data at a relatively coarse scale 2) Geostatistical models improve the accuracy of water chemistry predictions 3) Patterns of spatial autocorrelation differ between chemical response variables • Ecological processes acting at different spatial scales 4) SLD is the most suitable distance measure at regional scale at this time • Unsuitable survey designs • SHD: GIS processing time is prohibitive Conclusions 5) Results are scale specific • Spatial patterns change with survey scale • Other patterns may emerge at shorter separation distances 6) Further research is needed at finer scales • Watershed or small stream network 7) New survey designs for stream networks • Capture both coarse and fine scale variation • Ensure that hydrologic neighborhoods are represented Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model: A Case Study In Maryland Objective Demonstrate how a geostatistical methodology can be used to compliment regional water quality monitoring efforts 1) Predict regional water quality conditions 2) Identify the spatial location of potentially impaired stream segments 1996 MBSS DOC Data Kilometers 0 N n 312 Min 0.6 1st Qu. 1.2 20 Median 1.7 Mean 1.9 3rd Qu. 2.7 Max 15.9 σ2 1.8 Methods Potential covariates Covariate AREA URBAN BARREN WATER CONIFER DECIDFOR MIXEDFOR EMERGWET WOODYWET COALMINE EASTING NORTHING ER63-ER69 MEANELEV SLOPE ARGPERC CARPERC FELPERC MAFPERC SILPERC MEANK MAXTEMP MINTEMP PRECIP ANPRECIP Description Catchment area (ha) % Urban % Barren % Open Water % Conifer or evergreen forest type % Deciduous forest type % Mixed forest type % Emergent Herbacious Wetlands % Woody or shrubby wetlands % Coalmine Easting - Albers Equal Area Conic Northing - Albers Equal Area Conic Omernik's Level III Ecoregion Mean elevation in the watershed Mean slope in the watershed % Argillaceous rock type in watershed % Carbonic rock type in watershed % Felsic rock type in watershed % Mafic rock type in watershed % Siliceous rock type in watershed Mean soil erodability factor in watershed (adjusted for rock fragments) Mean annual maximum temperature (°C) Mean minimum temperature for January - April (°C) Mean precipitation for January - April (mm) Mean annual precipitation Spatial Resolution 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 1 foot 1 foot 1:7,500,000 30 meter 30 meter 1:250,000 1:250,000 1:250,000 1:250,000 1:250,000 1 4 4 4 4 kilometer kilometer kilometer kilometer kilometer Methods Potential covariates after initial model selection (10) Covariate AREA URBAN BARREN WATER CONIFER DECIDFOR MIXEDFOR EMERGWET WOODYWET COALMINE EASTING NORTHING ER63-ER69 MEANELEV SLOPE ARGPERC CARPERC FELPERC MAFPERC SILPERC MEANK MAXTEMP MINTEMP PRECIP ANPRECIP Description Catchment area (ha) % Urban % Barren % Open Water % Conifer or evergreen forest type % Deciduous forest type % Mixed forest type % Emergent Herbacious Wetlands % Woody or shrubby wetlands % Coalmine Easting - Albers Equal Area Conic Northing - Albers Equal Area Conic Omernik's Level III Ecoregion Mean elevation in the watershed Mean slope in the watershed % Argillaceous rock type in watershed % Carbonic rock type in watershed % Felsic rock type in watershed % Mafic rock type in watershed % Siliceous rock type in watershed Mean soil erodability factor in watershed (adjusted for rock fragments) Mean annual maximum temperature (°C) Mean minimum temperature for January - April (°C) Mean precipitation for January - April (mm) Mean annual precipitation Spatial Resolution 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 30 meter 1 foot 1 foot 1:7,500,000 30 meter 30 meter 1:250,000 1:250,000 1:250,000 1:250,000 1:250,000 1 4 4 4 4 kilometer kilometer kilometer kilometer kilometer Methods Fit geostatistical models • Two distance measures: SLD and WAHD Autocorrelation Function Exponential Restricted model space to all possible linear models • 1024 models per set • 9 model sets Parameter Estimation • Maximized profile log-likelihood function Spherical Mariah Hole Effect Linear with Sill Rational Quadratic SLD WAHD Methods Model selection within distance measure & autocorrelation function • Spatial AICC (Hoeting et al., in press) Model selection between distance measure & autocorrelation function • Cross-validation method using Universal kriging algorithm – 312 predictions • MSPE Model comparison: r2 for the observed vs. predicted values Results SLD models performed better than WAHD 1.6 1.4 Exponential Rational Quadratic Mariah Exponential 1 Rational Quadratic 0.997 1 SLD 3 4 WAHD Rational Quadratic 0.2 Linear with Sill 0.4 Hole Effect 0.6 Mariah 0.8 Spherical Best models: • SLD Exponential, Mariah, and Rational Quadratic models 1 Exponential Exception: Spherical model MSPE 1.2 0 1 Mariah 0.990 0.993 1 2 5 Autocorrelation Function 6 r2 for SLD model predictions • Almost identical • Further analysis restricted to SLD Mariah model Results Covariates for SLD Mariah model: WATER, EMERGWET, WOODYWET, FELPERC, & MINTEMP Nugget 0.15 Sill 0.28 Range 7.02 Intercept 0.28 WATER 0.05 EMERGWET 0.04 WOODYWET 0.02 Positive relationship with DOC: • WATER, EMERGWET, WOODYWET, MINTEMP Negative relationship with DOC • FELPERC FELPERC -0.0005 MINTEMP 0.07 Cross-validation intervals for Mariah model regression coefficients Cross-validation interval: 95% of regression coefficients produced by leave-one-out cross validation procedure Narrow intervals • Few extreme regression coefficient values – Not produced by common sites – Covariate values for the site are represented in observed data – Not clustered in space Model coefficients represent change in log10 DOC per unit of X Statistic Minimum Maximum Mean Standard Dev 95% Lower Limit 95% Upper Limit WATER (%) 0.0469 0.0537 0.0501 0.0007 0.0485 0.0522 EMERGWET (%) WOODYWET (%) FELPERC (%) 0.0306 0.0156 -0.0006 0.0425 0.0187 -0.0004 0.0344 0.0176 -0.0005 0.0009 0.0002 0.00005 0.0322 0.017 -0.0006 0.0366 0.0179 -0.0005 MINT (°C) 0.0616 0.071 0.0655 0.0007 0.0643 0.0669 r2 Observed vs. Predicted Values 18 Predicted DOC mg/l rR2 2==0.7221 0.7221 0 0 5 n = 312 sites r2 = 0.72 10 Observed DOC mg/l 15 1 influential site r2 without site = 0.66 Model Fit Squared Prediction Error (SPE) Discussion • SLD models more accurate than WAHD models • Landscape-scale covariates were not restricted to watershed boundaries – Geology type – Temperature – Wetlands & water Discussion Regression Coefficients Narrow cross-validation intervals • Spatial location of the sites not as important as watershed characteristics Extreme regression coefficient values • Not produced by common sites • Not clustered in space Local-scale factor may have affected stream DOC • Point source of organic waste Spatial Patterns in Model Fit North and east of Chesapeake Bay - large SPE values • Naturally acidic blackwater streams with elevated DOC • Not well represented in observed dataset – 2 blackwater sites • Geostatistical model unable to account for natural variability – Large square prediction errors – Large prediction variances SPE values Spatial Patterns in Model Fit West of Chesapeake Bay - low SPE values • Due to statistical and spatial distribution of observed data – Regression equation fit to the mean in the data – Most observed sites = low DOC values • Less variation in western and central Maryland – Neighboring sites tend to be similar • Separation distances shorter in the west – Short separation distances = stronger covariances SPE values Model Performance Unable to account for abrupt differences in DOC values between neighboring sites with similar watershed conditions What caused abrupt differences? • Point sources of organic pollution – Not represented in the model • Non-point sources of pollution – Lumped watershed attributes are non-spatial – Differences due to spatial location of landuse are not represented – Challenging to represent ecological processes using coarsescale lumped attributes – i.e. Flow path of water Generate Model Predictions Prediction sites • Study area – 1st, 2nd, and 3rd order non-tidal streams – 3083 segments = 5973 stream km • ID downstream node of each segment – Create prediction site • More than one site at each confluence Generate predictions and prediction variances • SLD Mariah model • Universal kriging algorithm • Assigned predictions and prediction variances back to stream segments in GIS DOC Predictions (mg/l) Weak Model Fit Strong Model Fit Water Quality Attainment by Stream Kilometers Threshold values for DOC • Set by Maryland Department of Natural Resources • High DOC values may indicate biological or ecological stress Theshold Low Medium High DOC (mg/l) < 5.0 5.0 - 8.0 > 8.0 Stream Kilometers 5387.67 400.19 185.16 Percent 90.2 6.7 3.1 Implications for Water Quality Monitoring 1) One geostatistical model can be used to predict DOC in stream segments throughout a large area • • Can be used to provide an estimate of regional stream DOC values Cannot identify point sources of organic pollution 2) Tradeoff between cost-efficiency and model accuracy Western Maryland • Can be described using a single geostatistical model Eastern and northeastern Maryland • Accept poor model fit • Collect additional survey data • Develop a separate geostatistical model for eastern Maryland Implications for Water Quality Monitoring 3) Apply this methodology to other regulated indices • • • e.g. conductivity and pH Categorize predictions into potentially impaired or unimpaired status Report on attainment in stream miles/kilometers Conclusions 1) Geostatistical models generated more accurate DOC predictions than previous non-spatial models based on coarse-scale landscape data 2) SLD is more appropriate than WAHD for regional geostatistical modeling of DOC at this time • • 3) Probability-based random survey designs Maryland, USA Adds value to existing water quality monitoring efforts • • • • Used to evaluate/report regional water quality conditions Additional field sampling is not necessary Generate inferences about regional stream condition ID spatial location of potentially impaired stream segments Conclusions 4) Model predictions and prediction variances • Additional field efforts concentrated in – – 5) Areas with large amounts of uncertainty Areas with a greater potential for water quality impairment Model results displayed visually • Communicate results to a variety of audiences Questions?