Using the Maryland Biological Stream Survey Data to Test Spatial Statistical Models A Collaborative Approach to Analyzing Stream Network Data Andrew A. Merton Overview The material presented here is a subset of the work done by Erin Peterson for her Ph.D. Interested in developing geostatistical models for predicting water quality characteristics in stream segments Data: Maryland Biological Stream Survey (MBSS) The scope and nature of the problem requires interdisciplinary collaboration Ecology, geoscience, statistics, others… Stream Network Data The response data is comprised of observations within a stream network What does it mean to be a “neighbor” in such a framework? How does one characterize the distance between “neighbors”? Should distance measures be confined to the stream network? Does flow (direction) matter? Stream Network Data Potential explanatory variables are not restricted to be within the stream network Topography, soil type, land usage, etc. How does one sensibly incorporate these explanatory variables into the analysis? Can we develop tools to aggregate upstream watershed covariates for subsequent downstream segments? Competing Models Given a collection of competing models, how does one select the “best” model? Is one subset of explanatory variables better or closer to the “true” model? Should one assume correlated residuals and, if so, what form should the correlation function take? How does the distance measure impact the choice of correlation function? Functional Distances & Spatial Relationships Geostatistical models are based on straight-line distance B A C Straight-line Distance (SLD) Is this an appropriate measure of distance? Influential continuous landscape variables: geology type or acid rain (As the crow flies…) Functional Distances & Spatial Relationships Distances and relationships are represented differently depending on the distance measure B A C Symmetric Hydrologic Distance (SHD) Hydrologic connectivity (As the fish swims…) Functional Distances & Spatial Relationships Distances and relationships are represented differently depending on the distance measure B A C Asymmetric Hydrologic Distance (AHD) Longitudinal transport of material (As the sh*t flows…) Candidate Models Restrict the model space to general linear models 2 Z ~ N ( X , ) N ( X , ) Look at all possible subsets of explanatory variables X (Hoeting et al) Require a correlation structure that can accommodate the various distance measures Could assume that the residuals are spatially independent, i.e., S = 2I (probably not best) Ver Hoef et al propose a better solution Asymmetric Autocovariance Models for Stream Networks Weighted asymmetric hydrologic distance (WAHD) Developed by Jay Ver Hoef, National Marine Mammal Laboratory, Seattle Moving average models Incorporates flow and uses hydrologic distance Represents discontinuity at confluences Flow Exponential Correlation Structure The exponential correlation function can be used for both SLD and SHD 1 hij 0 ij i , j 1 such that ij (1 1 ) exp hij / 2 hij 0 n For AHD, one must multiply (element-wise) by the weight matrix A, i.e., ij* = aij ij, hence WAHD The weights represent the proportion of flow volume that the downstream location receives from the upstream location Estimating the aij is non-trivial – Need special GIS tools (Theobald et al) GIS Tools Theobald et al have created automated tools to extract data about hydrologic relationships between sample points Visual Basic for Applications programs that: 1. Calculate separation distances between sites SLD, SHD, Asymmetric hydrologic distance (AHD) 2. Calculate watershed covariates for each stream segment Functional Linkage of Watersheds and Streams (FLoWS) 3. Convert GIS data to a format compatible with statistics software 1 2 2 1 3 SLD 3 SHD 1 2 3 AHD Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site •Weighted by catchment area: Surrogate for flow 1. Calculate influence of each upstream segment on segment directly downstream 2. Calculate the proportional influence of one sample site on another • Multiply the edge proportional influences 3. Output: • n×n weighted incidence matrix stream confluence stream segment Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site •Weighted by catchment area: Surrogate for flow 1. Calculate influence of each upstream segment on segment directly downstream 2. Calculate the proportional influence of one sample site on another • Multiply the edge proportional influences 3. Output: • n×n weighted incidence matrix stream confluence stream segment Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site •Weighted by catchment area: Surrogate for flow 1. Calculate influence of each upstream segment on segment directly downstream 2. Calculate the proportional influence of one sample site on another • Multiply the edge proportional influences 3. Output: • n×n weighted incidence matrix stream confluence stream segment Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site •Weighted by catchment area: Surrogate for flow 1. Calculate influence of each upstream segment on segment directly downstream A C 2. Calculate the proportional influence of one sample site on another • Multiply the edge proportional influences 3. Output: • n×n weighted incidence matrix B E D F G H survey sites stream segment Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site •Weighted by catchment area: Surrogate for flow 1. Calculate influence of each upstream segment on segment directly downstream A C 2. Calculate the proportional influence of one sample site on another • Multiply the edge proportional influences 3. Output: • n×n weighted incidence matrix B E D F G H Site PI = B * D * F * G Parameter Estimation Maximize the (profile) likelihood to obtain estimates for , , and 2 Profile likelihood: n n 1 n profile( ; ˆ , ˆ 2 , Z ) log( 2 ) log( ˆ 2 ) log 2 2 2 2 ˆ ( ) ( X ' 1 X ) 1 X ' 1Z MLEs 1 ( Z X )' ( Z X ) ˆ 2 ( ) n Model Selection Hoeting et al adapted the Akaike Information Corrected Criterion for spatial models AICC estimates the difference between the candidate model and the “true” model Select models with small AICC p k 1 AICC 2 profile( ; , , Z ) 2n n pk 2 2 where n is the number of observations, p-1 is the number of covariates, and k is the number of autocorrelation parameters Spatial Distribution of MBSS Data N Summary Statistics for Distance Measures • Distance measure greatly impacts the number of neighboring sites as well as the median, mean, and maximum separation distance between sites Summary statistics for distance measures in kilometers using DO (n=826). Distance Measure N Pairs Min Median Mean Max Straight Line Distance 340725 0.05 101.02 118.16 385.53 Symmetric Hydrologic Distance 62625 0.05 156.29 187.10 611.74 Pure Asymmetric * Hydrologic Distance 1117 0.05 4.49 5.83 27.44 * Asymmetric hydrologic distance is not weighted here Comparing Distance Measures The “selected” models (one for each distance measure) were compared by computing the mean square prediction error (MSPE) GLM: Assumed independent errors Withheld the same 100 (randomly) selected records from each model fit np 2 Want MSPE to be small ˆ ( Z Z ) i i MSPE i 1 np Comparing Distance Measures Prediction Performance for Various Responses ANC DOC COND 350000.00 40000.00 300000.00 35000.00 DO 9.00 2.50 GLM 8.00 2.00 7.00 30000.00 250000.00 6.00 25000.00 1. 5 0 200000.00 5.00 20000.00 15 0 0 0 0 . 0 0 4.00 15 0 0 0 . 0 0 10 0 0 0 0 . 0 0 10 0 0 0 . 0 0 50000.00 5000.00 0.00 0.00 MSPE GLM SL SH 1. 0 0 3.00 2.00 0.50 0.00 0.00 GLM WAH SL SH WAH GLM PHLAB NO3 1. 2 0 SLD 1. 0 0 SL SH GLM WAH SO4 0 . 18 400.00 0 . 16 350.00 1. 0 0 SL SH WAH TEMP SHD 9.00 8.50 0 . 14 300.00 0 . 12 0.80 250.00 8.00 0 . 10 0.60 0.40 0.20 0.06 15 0 . 0 0 0.04 10 0 . 0 0 0.02 50.00 GLM SL SH WAH 7.50 7.00 0.00 0.00 WAHD 200.00 0.08 0.00 GLM SL SH WAH 6.50 GLM SL SH WAH GLM SL SH WAH Maps of the Relative Weights Generated maps by kriging (interpolation) Predicted values are linear combinations of the “observed” data, i.e., E ( Z 2 | Z1 ) 1 1 1 1 1 ( X 2 ( X 1T 11 X 1 ) 1 X 1T 11 2111 ( I X 1 ( X 1T 11 X 1 ) 1 X 1T 11 )) Z1 MZ1 Z1 is the observed data, Z2 is the predicted value, 11 is the correlation matrix for the observed sites, and is the correlation matrix between the prediction site and the observed sites Relative Weights Used to Make Prediction at Site 465 General Linear Model Straight-line Symmetric Hydrologic Weighted Asymmetric Hydrologic Relative Weights Used to Make Prediction at Site 465 General Linear Model Straight-line Symmetric Hydrologic Weighted Asymmetric Hydrologic Residual Correlations for Site 465 General Linear Model Symmetric Hydrologic Straight-line Weighted Asymmetric Hydrologic Residual Correlations for Site 465 General Linear Model Straight-line Symmetric Hydrologic Weighted Asymmetric Hydrologic Some Comments on the Sampling Design Probability-based random survey design • Designed to maximize spatial independence of survey sites • Does not adequately represent spatial relationships in stream networks using hydrologic distance measures 275 244 244 sites did not have neighbors Sample Size = 881 Number of sites with ≥ 1 neighbor: 393 Mean number of neighbors per site: 2.81 Frequency 149 133 109 66 38 35 32 12 19 7 15 13 6 1 0 0 2 13 14 15 16 17 0 0 1 2 3 4 5 6 7 8 9 10 11 Number of Neighboring Sites 12 Conclusions A collaborative effort enabled the analysis of a complicated problem Ecology – Posed the problem of interest, provides insight into variable (model) selection Geoscience – Development of powerful tools based on GIS Statistics – Development of valid covariance structures, model selection techniques Others – e.g., very understanding (and sympathetic) spouses…