Local Enhancement of Global Estimation Molly Leecaster, Ph.D. Kerry Ritter, Ph.D. DAMARS and STARMAP 2nd Annual Conference Oregon State University Corvallis, OR August 11, 2003 Acknowledgement PROJECT FUNDING • The work reported here was developed under the STAR Research Assistance Agreement CR-829095 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This presentation has not been formally reviewed by EPA. The views expressed here are solely those of the presenter and STARMAP, the Program they represent. EPA does not endorse any products or commercial services mentioned in this presentation. Outline of Presentation • Introduction • Two-stage sample design • Spatial modeling of binary EMAP data – Indicator kriging – Conditional autoregressive model • Simulation Example • Future work Introduction • EMAP developed for estimation of areal extent of resources – Sample locations are spatially separated • EMAP participants are interested in global estimation but also have local concerns – Spatial modeling • EMAP data does not provide information on the local spatial structure required for good spatial models • Therefore …. Augment EMAP design to improve spatial modeling Goals • Present enhancement to EMAP design • Use of enhanced sample in spatial models of indicator data – Indicator kriging – Conditional autoregressive model Outline of Presentation • Introduction • Two-stage sample design • Spatial modeling of EMAP data • Simulation Example • Future work Two-stage: Systematic Grid Plus Star Cluster Sample Design • Two-stage because two goals – Systematic (EMAP) grid for global structure – Star cluster sample for variogram estimation • Enhance EMAP design with additional sample locations – Ideal for areal extent and prediction – Ideal for variogram estimation Two-Stage Design 80 StarData1F Pink…….…….absence Blue…….…….presence 60 Black….……...systematic Green.………..star clusters 1 20 40 y Orange…..…..star clusters 2 0 20 40 x 60 80 Stage One: Systematic Component (EMAP) • Based on global estimation requirements – e.g. 30 spatially separated locations per strata Stage Two: Star Cluster Component • Star clusters of sample sites around stage-one locations • Star clusters provide estimate of small scale pairwise variance • Star clusters also provide many added pairs of samples at various distance lags • Star clusters provide directional information at small scale • How to specify star clusters? Stage Two: Star Cluster Component • Location of star clusters – Adaptive, locate at specified observed response • Does this bias the variogram estimation? – Random stage-one locations – Systematic subset of stage-one locations • Size of star clusters – Diameter of star = variogram range – Diameter of star > variogram range • Number of star clusters – At least two, but how many more? Outline of Presentation • Introduction • Two-stage sample design • Spatial modeling of EMAP data • Simulation Example • Future work Spatial Models for Binary Data • Indicator kriging for geo-referenced data • Conditional autoregressive model for binary lattice data Indicator Kriging • Binary geo-referenced data • Spatial correlation structure modeled from data • Precision of predictions depends on sample spacing and variogram parameters Ordinary Indicator Kriging * F • Estimate local indicator mean, oIK u; z k location , at each u • Apply simple IK estimator using estimated mean I u; zk * OK n u u; z k I u ; z k SK 1 SK m u; zk F u; zk * oIK Conditional Autoregressive Model for Binary Data • Binary lattice data • Spatial correlation structure assumed: locally (neighborhood) dependent Markov random field • Neighborhood defined as fixed pattern of surrounding grid points • Precision of predictions depends on neighborhood structure, grid size, and variance of response Conditional Autoregressive Model for Binary Data yi xi ai xi ~ Bernoulli ( pi ) exp 0 sii pi 1 exp 0 sii yi observed presence/a bsence xi true presence/a bsence ai sample indicator si sum of neighborho od presence/a bsence Comparison of Models • Ordinary Indicator Kriging – Advantages • Knowledge of spatial relationship improves prediction • Assumed spatial relationship based on data – Disadvantages • Not robust to variogram mis-specification • Requires strong stationarity assumption • Conditional autoregressive – Advantages • No need to estimate or model variogram • Can be used without geo-referenced data – Disadvantages • Assumed spatial relationship based on a grid size that could be inaccurate Outline of Presentation • From last year to now … progress & new directions • Two-stage sample design • Spatial modeling of EMAP data • Simulation Example • Future work Simulation Example • Used simulation so spatial structure was known • Simulated response from specific variogram model on to 50x50 hexagon grid of points • Specified presence/absence cutoff • Applied two-stage sample design (2 realizations) • Estimated and modeled variogram from sample data – For some, did two manual and one automatic fit • Predicted probability of presence using indicator kriging and conditional autoregressive model Simulation Methods • Simulated data from Gaussian random field (S-Plus) – Spherical variogram, range = 22, sill = 0.4, nugget = 0 – Simulated value > 2 => presence • Sample Designs – Systematic sample (n=30) – Systematic sample plus 2 star clusters (n=54) – Systematic sample plus 4 star clusters (n=78) • Models – Indicator kriging – Conditional autoregressive model Data Simulation with Sample Sites StarData1F 80 Pink…….…….absence Blue…….…….presence 60 Black….……...systematic Green.………..star clusters 1 20 40 y Orange…..…..star clusters 2 0 20 40 x 60 80 Variogram for Sample Designs Systematic 0.5 0 10 20 30 40 50 0.1 0.00 0.2 gamma 0.05 0.3 0.4 0.10 gamma 0.15 Systematic + 2 Stars 0.0 distance 0 10 20 30 40 Systematic + 4 Stars 50 Sill Nugget Systematic 17 0.17 0 Sys. + 2 20 0.4 0 Sys. + 4 14 0.4 0 0.2 0.1 0.0 gamma 0.3 Range 0.4 distance 0 10 20 30 distance 40 50 Systematic Sample Results Ordinary Indicator Kriging Predictions 80 From Systematic Sample on Data 1F 60 0 0.2 0.4 0.6 0.8 1 80 40 Conditional Autregressive Model Predictions From Systematic Sample on Data 1F 60 20 0 0.2 0.4 0.6 0.8 1 StarData1F 20 40 60 80 20 20 40 y 60 40 80 0 0 20 40 x 60 80 0 20 40 60 80 Systematic Sample with 2 Stars 80 Ordinary Indicator Kriging Predictions From Systematic + 2 Star Sample on Data 1F 60 0 0.2 0.4 0.6 0.8 1 20 80 40 Conditional Autregressive Model Predictions From Systematic +2 Star Sample on Data 1F 60 0 0.2 0.4 0.6 0.8 1 StarData1F 20 40 60 80 20 40 20 y 60 40 80 0 0 0 20 40 x 60 80 20 40 60 80 Systematic Sample with 4 Stars 80 Ordinary Indicator Kriging Predictions From Systematic + 4 Star Sample on Data 1F 60 0 0.2 0.4 0.6 0.8 1 80 40 Conditional Autregressive Model Predictions From Systematic + 4 Star Sample on Data 1F 60 20 0 0.2 0.4 0.6 0.8 1 StarData1F 20 40 60 80 20 40 y 20 60 40 80 0 0 0 20 40 x 60 80 20 40 60 80 Three Fits: Systematic + 2 Stars 0.4 0.5 Automatic Fit gamma 0.0 0.3 0.1 0.4 0.2 0.5 gamma 0.3 Manual Fit #1 objective = 0.1467 10 20 30 40 50 0.2 0 Range Sill Nugget 0.0 0.1 distance objective = 0.2307 0 10 20 30 40 Manual Fit #2 50 0.3 0 20 0.4 0 11 0.27 0 All use correct model 0.0 0.1 0.2 gamma 0.3 0.4 17 0.5 distance objective = 0.197 0 10 20 30 distance 40 50 Predictions from 3 Variogram Fits 80 Automatic Fit 0 Manual Fit #1 0.2 Ordinary Indicator Kriging Predictions 0.4 From Systematic + 2 Star Sample on Data 1F 0.6 80 60 0.8 1 40 20 60 40 0 0.2 0.4 0.6 0.8 1 40 StarData1F 60 80 20 20 60 60 80 0 80 Manual Fit #2 20 40 60 80 20 20 40 40 y 0 0 20 40 x 60 80 Comparison of Prediction Errors • Sensitivity – Number of presence sites predicted to be present • Specificity – Number of absence sites predicted to be absent • True Positive Rate – Number of predicted presence sites that truly are present • True Negative Rate – Number of predicted absence sites that truly are absent Comparison of Predictions (Data1F) (positive if probability > 0.5)(Auto, Manual #2) Model Indicator Kriging Conditional Auto. Sample Sensitivity Specificity True True Positive Negative Rate Rate Systematic 28% 98% 85% 74% Systematic + 2 Stars Systematic + 4 Stars Systematic Systematic + 2 Stars Systematic + 4 Stars 41% 94% 77% 77% (36%, 27%) (96%, 99%) (80%, 76%) (90%, 74%) 32% 97% 85% 75% 15% 96% 63% 70% 56% 85% 64% 80% 54% 86% 65% 80% Comparison of Predictions (Data1F) (positive if probability > 0.3)(Auto, Manual #2) Model Indicator Kriging Sample Sensitivity Specificity True True Positive Negative Rate Rate Systematic 48% 91% 71% 78% Systematic + 2 Stars Systematic + 4 Stars Conditional Systematic Auto. Systematic + 2 Stars Systematic + 4 Stars 59% 85% 65% 81% (56%, 44%) (87%, 93%) (67%, 76%) (80% ,78%) 49% 91% 73% 79% 48% 80% 53% 76% 80% 46% 42% 83% 80% 49% 43% 83% Data Simulation with Sample Sites 80 StarData3F Pink…….…….absence 60 Blue…….…….presence y Black….……...systematic 40 Green.………..star clusters 1 20 Orange…..…..star clusters 2 0 20 40 x 60 80 Variograms for Sample Designs Systematic gamma 0 10 20 30 0.05 0.0 0.10 0.15 0.20 0.25 0.2 0.1 gamma 0.30 0.3 Systematic + 2 Stars 40 0.00 distance 0 10 20 30 Systematic + 4 Stars 40 0.27 0 Sys. + 2 12 0.30 0.05 Sys. + 4 13 0.30 0.03 0.2 Systematic 15 gamma Nugget 0.1 Sill 0.0 Range 0.3 distance 0 10 20 30 distance 40 Systematic Sample Results 80 Ordinary Indicator Kriging Predictions From Systematic Sample on Data 3F 60 0 0.2 0.4 0.6 0.8 1 80 40 Conditional Autregressive Model Predictions From Systematic Sample on Data 3F 60 20 0 0.2 0.4 0.6 0.8 1 StarData3F 20 40 60 80 20 40 y 20 60 40 80 0 0 0 20 40 x 60 80 20 40 60 80 Systematic Sample with 2 Stars 80 Ordinary Indicator Kriging Predictions From Systematic + 2 Star Sample on Data 3F 40 60 0 0.2 0.4 0.6 0.8 1 StarData3F 20 40 60 80 20 40 20 y 60 40 80 0 0 0.2 0.4 0.6 0.8 1 60 20 80 Conditional Autregressive Model Predictions From Systematic +2 Star Sample on Data 3F 0 0 20 40 x 60 80 20 40 60 80 Systematic Sample with 4 Stars 80 Ordinary Indicator Kriging Predictions From Systematic + 4 Star Sample on Data 3F 60 0 0.2 0.4 0.6 0.8 1 80 40 Conditional Autregressive Model Predictions From Systematic + 4 Star Sample on Data 3F 60 20 0 0.2 0.4 0.6 0.8 1 StarData3F 20 40 60 80 20 40 y 20 60 40 80 0 0 0 20 40 x 60 80 20 40 60 80 Three Fits: Systematic objective = 0.0356 0 10 20 30 40 0.2 gamma 0.1 0.3 0.2 Manual Fit #1 0.0 0.1 distance Range Sill Nugget 0.0 objective = 0.0519 0 10 20 30 Manual Fit #2 40 distance .27 0 8 .22 0 All use correct model 0.2 15 0.3 .21 gamma .25 0.1 30 0.0 gamma 0.3 Automatic Fit objective = 0.0333 0 10 20 distance 30 40 Predictions from 3 Variogram Fits 80 Automatic Fit Manual Fit #1 40 60 40 80 Ordinary Indicator Kriging Predictions From Systematic Sample on Data 3F 20 60 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 20 Manual Fit #2 20 40 60 80 80 0 StarData3F 20 40 60 0 0.2 0.4 0.6 0.8 1 80 20 20 40 y 40 60 60 80 0 0 20 40 x 60 80 0 20 40 60 80 Comparison of Predictions (Data3F) (positive if probability > 0.5)(Auto, Manual #2) Model Indicator Kriging Sample Sensitivity Specificity True True Positive Negative Rate Rate Systematic 31% 92% 65% 73% Systematic + 2 Stars Systematic + 4 Stars Conditional Systematic Auto. Systematic + 2 Stars Systematic + 4 Stars (1%, 15%) (99%, 97%) (88%, 69%) (68%, 70%) 21% 96% 75% 72% 24% 97% 81% 72% 7% 98% 65% 69% 17% 97% 71% 71% 18% 99% 88% 71% Comparison of Predictions (Data3F) (positive if probability > 0.3)(Auto, Manual #2) Model Indicator Kriging Sample Sensitivity Specificity True Positive Rate Systematic 62% 80% 60% Systematic + 2 Stars Systematic + 4 Stars Conditional Systematic Auto. Systematic + 2 Stars Systematic + 4 Stars True Negative Rate 81% (72%, 37%) (69%, 89%) (53%, 63%) (84%, 75%) 43% 90% 68% 77% 44% 91% 71% 77% 68% 57% 41% 77% 78% 58% 47% 84% 80% 56% 47% 85% Simulation Conclusions - Design • Two star clusters improved small-scale features of variogram • Two star clusters improved prediction accuracy • Four star clusters offered little improvement over two stars Simulation Conclusions - Models • Variogram model affects predictions • Kriging tends toward overall mean probability of presence, i.e. it smooths • Kriging builds patches whose diameter is approximately the range of the variogram • Conditional autoregressive model attempts to connect observed presence • Neither model had consistently higher sensitivity or specificity Outline of Presentation • From last year to now … progress & new directions • Two-stage sample design • Spatial modeling of EMAP data • Simulation Example • Future work Future Work • Further simulation studies on two stage design – Effect of sample size – Number of star clusters necessary to improve variogram estimation – Effect of size of star clusters – Bias from adaptive second-stage sampling – Advantages of indicator kriging and conditional autoregressive model – Sensitivity of conditional autoregressive model to initial values, prior distributions, and grid size – Sensitivity of kriging to variogram model specification Future Work • Apply two-stage sample design to real data – DDT data from Santa Monica Bay, CA – EMAP data and local monitoring data • Freely distribute functions for applying the conditional autoregressive model on a hexagon lattice – Functions in R to produce hexagon lattice input for WinBUGS – File in WinBUGS to apply model • Investigate optimal grid size to achieve EMAP and spatial modeling goals Systematic (EMAP) Grid Based on Variogram Model • Kriging variance n (u ) 2 OK u C 0 OK u C u u OK u 1 where C 0 is the covariance at distance 0 OK is the kriging weight C u u is the distance - dependent covariance term • Analog for conditional autoregressive model AL in neighborho od of u 1 n(u ) not in neighborho od of u 0 Systematic (EMAP) Grid Based on Variogram Model • Prediction variance is minimized by large covariance between prediction location and sample locations • For kriging, grid refers to sample locations • For conditional autoregressive, grid refers to sample locations and prediction locations • Want -------- Sample locations “close” together – Samples too far apart => • Kriging -> correctly uses no spatial relationship • Conditional autoregressive -> incorrectly uses assumed spatial relationship – Samples too close together => waste of resources