Predicting Median Substrate for Oregon and Washington EMAP sites Utilizing GIS data Julia J. Smith December 12, 2005 Why Predict Median Substrate? Indicator of overall stream health • Bed load transport • Stream Power • Microinvertebrate habitat • Fish habitat • How is human development affecting a stream What is LD50? LD50 is a measure of median substrate. • • • • Geometric mean of class boundaries Log10 of the geometric means Several samples at each site LD50 is the median value of log10(geometric mean of class) Substrate Classifications Substrate Class Size (mm) 8000-4000 Bedrock Geometric mean 5656.85 Log10 of geom. mean 3.7527 4000-250 Boulders 1000.00 3.0000 126.49 2.1020 64-16 Gravel (coarse) 32.00 1.5052 16-2 Gravel (fine) 5.66 0.7526 2-.06 Sand 0.35 -0.4604 .06-.001 Fines 0.00775 -2.1109 250-64 Cobbles Washington EPA Sites for LD50 Study LD50 key -2.11 -0.46 0.15 0.75 1.13 1.51 1.80 2.10 2.55 3 3.75 Oregon EPA Sites for LD50 Study LD50 key -2.11 -1.29 -0.46 0.15 0.75 1.13 1.51 1.80 2.10 3 3.75 Geomorphic Metrics hS D50 * * ( s )gt c ( s )t c is the total bank-full shear stress s is the density of sediment is fluid density g is gravitational acceleration h is bank-full depth S is channel slope * tc is critical sheer stress Geomorphic Metrics 0.20 0.15 0.10 0.05 0.00 Distance Weighted Stream Power 0.25 0.30 Distance-weighted Stream Power versus LD50 r = 0.327, p-value = 2.63 x 10 -12 -2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 LD50 1.804 2.102 2.551 3 3.753 Geomorphic Metrics 0.10 0.05 0.00 Slope 0.15 0.20 Outlet link mean slope versus LD50 r = 0.214, p-value = 3.78 x 10-6 -2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 LD50 1.804 2.102 2.551 3 3.753 Geologic Metrics 0.6 0.4 0.2 0.0 Percent Unconsolidated 0.8 1.0 Percent Unconsolidated Geologic type versus LD50 r = -0.246, p-value = 1.18 x 10-7 -2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 LD50 1.804 2.102 2.551 3 3.753 Climatic Metrics 3000 2000 1000 Average Annual Precipitation 4000 Annual average precipitation versus LD50 r = 0.199, p-value = 1.56 x 10-6 -2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 LD50 1.804 2.102 2.551 3 3.753 200 400 600 800 1000 1200 Average annual potential evapotranspiration (mm) versus LD50 r = -0.046, p-value = 0.342 0 Average Annual Potential Evapotranspiration 1400 Climatic Metrics -2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 LD50 1.804 2.102 2.551 3 3.753 Land Cover Metrics 1. 2. 3. 4. 5. 6. 7. 8. Developed Barren Forest Grasses Agriculture Wetlands Open water/perennial ice and snow Shrubland Land Cover Metrics 0.6 0.4 0.2 0.0 Percent Forest 0.8 1.0 Percentage of watershed that is forest versus LD50 r = 0.19, p-value = 3.516 x 10-5 -2.111 -1.286 -0.46 0.146 0.753 1.129 1.505 LD50 1.804 2.102 2.551 3 3.753 Distance-Weighted metrics dj Weighted Area j A j (e n A (e i 1 ) di i ) j represents the land cover type of concern, Aj represents the total area for land cover type j in the watershed, represents the coefficient of exponential decay, d j represents average distance from outlet for land cover of type j n represents the total number of the land cover types Additional Land Cover Metrics Buffered Metrics – Buffered within a measure of the stream (30 meters, 100 meters, 300 meters) Buffered and Distance-weighted metrics Goals Predict LD50 without visiting sites Small number of predictors for scientifically sensible model MethodsStepwise Variable Selection Multiple Linear Regression Top-in-tier models Top geomorphic models plus one from each of the remaining tiers Akaike’s Information Criterion RSS N log 2( p 2) N N observations p predictors RSS is the sum of squared residuals AIC in stepwise variable selection Forward Stepwise Selection Method for choosing the top predictor from each tier 1. Start with the intercept model 2. Choose the variable that reduces AIC the most and include in model. Stepwise selection in both directionsMethod chosen for choosing all top Geomorphic predictors 1. Start with full model. 2. Add and subtract variables until the model with minimum AIC is found or iteration stops. Methods: CART Classification and Regression Trees DWSP2< 0.03129 | snow_jan< 190.6 prcp_may< 46.6 link_sa4< 0.08306 MENTR>=20.35 b30_l11< 0.003034 r8_l80_A>=0.0917 b100_l51< 0.004057 0.565 -1.66 -1.03 0.69 prcp_sep< 19.05 prcp_jan< 47.49 min_elev>=1025 1.65 avgt_jun>=12.58 0.941 -0.823 0.298 mint_apr>=2.647 1.49 b30_r7_l30>=0.01239 -1.04 -0.172 1.02 0.439 1.49 2.01 Methods: CART Classification and Regression Trees Predicted Response: yˆ ( xi ) q ˆ a j 1x N j 1 i j Hybrid of Multiple Linear Regression and CART Utilize CART on the residuals Add indicator variables to the multiple linear regression equation for one minus the number of terminal nodes in the tree Create new multiple regression model with variables and indicator variables Predictive-ability Statistics n 2 ˆ PRESS p (Yi Yi (i ) ) i 1 R 2 prediction 1 PRESS p SSTO Analysis Comparison – Top 4-tier Models Problems with top 4-tier models Low Adjusted R2 Low Predictive Ability Over-prediction and under-prediction of fine and bedrock substrate Non-normal residuals Benefit of top 4-tier models Small number of predictors Example of Non-normality of Residuals Top 4-Tier Model 0 -1 -2 -3 Sample Quantiles 1 2 Normal Q-Q Plot -3 -2 -1 0 1 Theoretical Quantiles 2 3 Analysis Comparison – Geomorphic plus Top 3-Tier Models Problems with top geomorphic plus top 3-tier model Increase in number of variables Predictive ability still low Over-prediction and under-prediction of fine and bedrock substrate Some collinearity between variables Analysis Comparison – Geomorphic plus Top 3-Tier Models Benefits with top geomorphic plus top 3-tier model Improved predictions Improved normality of residuals Comparison of Analysis - CART Problems with CART Low predictive-ability Predicts several observed substrate sizes in one node Over-prediction and under-prediction of fines and bedrock substrate Omitting one site creates different tree Benefits of CART Simple analysis Missing variables not an issue 2 1 0 -1 -2 LD50 CART Predictions 3 4 CART Predictions -2 -1 0 1 Observed LD50 Values 2 3 4 Comparison of Analysis-Hybrids Problems with hybrid models Increased number of variables Collinearity with introduction of node indicator variables Non-normal residuals Comparison of Analysis-Hybrids Benefit of hybrid models Residuals closer to normal Increased predictive-ability Explains some of the variation created by fitting a linear model to ordinal data One example: Residual Tree for Hybrid Geomorphic plus Top 3-Tier Model Most promising multiple regression prediction model: Geomorphic plus top 3-tier Response LD50 Adjusted PRESSp R2 for LD50 0.362 504.802 MSPR 2 Rprediction 1.274 0.319 One example: Residual Tree for Hybrid Geomorphic plus Top 3-Tier Model slp_elon< 0.3566 | out_sa< 0.008686 -0.8348 link_slope>=0.002764 CVENTR>=0.1489 topo_wet>=8.152 out_sa>=0.004734 0.6496 0.8367 shed_slp>=14.97 -0.6906 -1.1 -0.1191 link_sa< 0.0431 0.7804 link_sa>=0.08093 -0.6472 b30_r5_l42>=0.929 0.4488 b30_r5_l42< 0.5441 CVCON>=0.4208 avgt_jun< 12.32 CVCON>=0.4342 -0.8996 -0.09977 b30_r5_l42>=0.759 slp_elon< 0.5467 MENTB>=15.63 0.581 -0.9114 0.2462 -0.2892 0.4309 -0.97080.0004686 One example: Observed vs. Predicted for Hybrid Geomorphic plus Top 3-Tier Model 2 0 -2 Cross-validation LD50 Predictions 4 Plot of predictions against observed LD50 -2 -1 0 1 Observed LD50 Values 2 3 QQ-Plot of Residuals for Hybrid Model 0 -1 -2 -3 Sample Quantiles 1 2 Normal Q-Q Plot -3 -2 -1 0 1 Theoretical Quantiles 2 3 Coast Range Ecoregion Less skewed distribution of LD50 No measurements are outliers Similar ecosystem throughout region Ecoregion Distributions Willamette Valley Snake River Plain Puget Low land level.3.ecoregion Northern Rockies Northern Basin and Range North Cascades Klamath Mountains Eastern Cascades Slopes and Foothills Columbia Plateau Colorado Plateau Coast Range Cascades Blue Mountains -3 -1 1 LD50 3 Coast Range EMAP Sites LD50 key -2.11 -1.29 -0.46 0.75 1.13 1.51 1.80 2.10 3 3.75 Top 4-Tier Coast Range Model Predictors Average aspect (climatic) Average watershed elevation (geomorphic) % watershed as volcanic geologic type (geologic) % wetlands (distance weighted and buffered) QQ-Plot: Top 4-Tier Coast Range 0 -1 -2 Sample Quantiles 1 2 Normal Q-Q Plot -2 -1 0 Theoretical Quantiles 1 2 2 1 0 -1 -2 -3 Cross-Validated LD50 Predictions 3 Observed versus Predicted: Top 4-Tier Coast Range Model -2 -1 0 1 Observed LD50 2 3 Coast Range Model Top Geomorphic Variables 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Average watershed elevation (m) Drainage density Mean slope within a 300-meter buffer Ratio of width of stream to width of floodplain Coefficient of average hill connectivity Distance to the first tributary (m) Percent of landscape with less than 4% slope Percent of landscape with less than 7% slope Measure of size and complexity of river Percent of stream as cascade Distance-weighted stream power Watershed relief divided by its length QQ-Plot: Coast Range Geomorphic plus Top 3-Tier model 0 -1 -2 -3 Sample Quantiles 1 2 Normal Q-Q Plot -2 -1 0 Theoretical Quantiles 1 2 1 0 -1 -2 -3 Cross-validation LD50 Predictions 2 3 Observed versus Predicted: Coast Range Geomorphic + Top 3-Tier -2 -1 0 1 Observed LD50 2 3 CART - Coast Range Ecoregion 3 2 1 0 -1 -2 CART Predicted LD50 Values 4 Predictions versus Observed LD50 -2 -1 0 1 Observed LD50 Values 2 3 4 Coast Range: Hybrid Models Benefits of hybrid Improved prediction Improved fit Improved normality of residuals Problems with hybrid Increased number of predictors Collinearity with node indicator variables QQ-Plot: Coast Range Hybrid Top 4-Tier 0 -1 -2 -3 Sample Quantiles 1 2 Normal Q-Q Plot -2 -1 0 Theoretical Quantiles 1 2 1 0 -1 -2 Cross-Validation LD50 Predictions 2 3 Observed versus Predicted: Coast Range Hybrid Top 4-Tier -2 -1 0 1 Observed LD50 Values 2 3 QQ-Plot: Coast Range Hybrid Geomorphic plus Top 3-Tier 0 -1 -2 Sample Quantiles 1 2 Normal Q-Q Plot -2 -1 0 Theoretical Quantiles 1 2 Observed versus Predicted: 0 -2 -4 Cross-validation LD50 Predictions 2 Coast Hybrid Geomorphic plus Top 3-Tier -2 -1 0 1 Observed LD50 2 3 Comparison of Coast Models Model Adjusted R2 2 Rprediction Top 4-tier 0.384 0.362 Geomorphic plus top-3 0.548 0.495 NA 0.087 Top 4-tier hybrid 0.552 0.503 Geomorphic plus top-3 hybrid 0.700 0.614 CART Conclusions LD50 is difficult to predict Additional geomorphic predictors increases prediction ability Hybrid models increase prediction ability More success in Coast Range Ecoregion Future Work Logistic Regression Ordinal data treated as continuous in this study 12 categories might require more sophisticated methods Spatial Analysis Appears to be spatial correlation in distribution of LD50