Predicting Median Substrate for Oregon and Washington EMAP sites Utilizing GIS data

advertisement
Predicting Median Substrate
for Oregon and Washington
EMAP sites
Utilizing GIS data
Julia J. Smith
December 12, 2005
Why Predict Median Substrate?
Indicator of overall stream health
• Bed load transport
• Stream Power
• Microinvertebrate habitat
• Fish habitat
• How is human development
affecting a stream
What is LD50?
LD50 is a measure of median substrate.
•
•
•
•
Geometric mean of class boundaries
Log10 of the geometric means
Several samples at each site
LD50 is the median value of
log10(geometric mean of class)
Substrate Classifications
Substrate
Class
Size (mm)
8000-4000 Bedrock
Geometric
mean
5656.85
Log10 of
geom. mean
3.7527
4000-250 Boulders
1000.00
3.0000
126.49
2.1020
64-16 Gravel (coarse)
32.00
1.5052
16-2 Gravel (fine)
5.66
0.7526
2-.06 Sand
0.35
-0.4604
.06-.001 Fines
0.00775
-2.1109
250-64 Cobbles
Washington EPA Sites for LD50 Study
LD50 key
-2.11
-0.46
0.15
0.75
1.13
1.51
1.80
2.10
2.55
3
3.75
Oregon EPA Sites for LD50 Study
LD50 key
-2.11
-1.29
-0.46
0.15
0.75
1.13
1.51
1.80
2.10
3
3.75
Geomorphic Metrics

hS
D50 

*
*
( s   )gt c ( s   )t c
 is the total bank-full shear stress
s is the density of sediment
 is fluid density
g is gravitational acceleration
h is bank-full depth
S is channel slope
*
tc is critical sheer stress
Geomorphic Metrics
0.20
0.15
0.10
0.05
0.00
Distance Weighted Stream Power
0.25
0.30
Distance-weighted Stream Power versus LD50
r = 0.327, p-value = 2.63 x 10 -12
-2.111
-1.286
-0.46
0.146
0.753
1.129
1.505
LD50
1.804
2.102
2.551
3
3.753
Geomorphic Metrics
0.10
0.05
0.00
Slope
0.15
0.20
Outlet link mean slope versus LD50
r = 0.214, p-value = 3.78 x 10-6
-2.111
-1.286
-0.46
0.146
0.753
1.129
1.505
LD50
1.804
2.102
2.551
3
3.753
Geologic Metrics
0.6
0.4
0.2
0.0
Percent Unconsolidated
0.8
1.0
Percent Unconsolidated Geologic type versus LD50
r = -0.246, p-value = 1.18 x 10-7
-2.111
-1.286
-0.46
0.146
0.753
1.129
1.505
LD50
1.804
2.102
2.551
3
3.753
Climatic Metrics
3000
2000
1000
Average Annual Precipitation
4000
Annual average precipitation versus LD50
r = 0.199, p-value = 1.56 x 10-6
-2.111
-1.286
-0.46
0.146
0.753
1.129
1.505
LD50
1.804
2.102
2.551
3
3.753
200
400
600
800
1000
1200
Average annual potential evapotranspiration (mm) versus LD50
r = -0.046, p-value = 0.342
0
Average Annual Potential Evapotranspiration
1400
Climatic Metrics
-2.111
-1.286
-0.46
0.146
0.753
1.129
1.505
LD50
1.804
2.102
2.551
3
3.753
Land Cover Metrics
1.
2.
3.
4.
5.
6.
7.
8.
Developed
Barren
Forest
Grasses
Agriculture
Wetlands
Open water/perennial ice and snow
Shrubland
Land Cover Metrics
0.6
0.4
0.2
0.0
Percent Forest
0.8
1.0
Percentage of watershed that is forest versus LD50
r = 0.19, p-value = 3.516 x 10-5
-2.111
-1.286
-0.46
0.146
0.753
1.129
1.505
LD50
1.804
2.102
2.551
3
3.753
Distance-Weighted metrics
dj
Weighted Area j 
A j (e
n
 A (e
i 1
)
 di
i
)
j represents the land cover type of concern,
Aj represents the total area for land cover type j in the watershed,
 represents the coefficient of exponential decay,
d j represents average distance from outlet for land cover of type j
n represents the total number of the land cover types
Additional Land Cover Metrics


Buffered Metrics – Buffered within a
measure of the stream (30 meters,
100 meters, 300 meters)
Buffered and Distance-weighted
metrics
Goals


Predict LD50 without visiting sites
Small number of predictors for
scientifically sensible model
MethodsStepwise Variable Selection

Multiple Linear Regression


Top-in-tier models
Top geomorphic models plus one from
each of the remaining tiers
Akaike’s Information Criterion
 RSS 
N log 
  2( p  2)
 N 
N observations
p predictors
RSS is the sum of squared residuals
AIC in stepwise variable selection
Forward Stepwise Selection Method for choosing the top predictor from each tier
1. Start with the intercept model
2. Choose the variable that reduces AIC the most and
include in model.
Stepwise selection in both directionsMethod chosen for choosing all top Geomorphic
predictors
1. Start with full model.
2. Add and subtract variables until the model with
minimum AIC is found or iteration stops.
Methods: CART
Classification and Regression Trees
DWSP2< 0.03129
|
snow_jan< 190.6
prcp_may< 46.6
link_sa4< 0.08306
MENTR>=20.35
b30_l11< 0.003034
r8_l80_A>=0.0917
b100_l51< 0.004057 0.565
-1.66
-1.03
0.69
prcp_sep< 19.05
prcp_jan< 47.49
min_elev>=1025
1.65
avgt_jun>=12.58
0.941
-0.823
0.298
mint_apr>=2.647
1.49
b30_r7_l30>=0.01239
-1.04
-0.172
1.02
0.439
1.49
2.01
Methods: CART
Classification and Regression Trees
Predicted Response:
yˆ ( xi ) 
q
ˆ
a
 j 1x N 
j 1
i
j
Hybrid of Multiple Linear Regression
and CART



Utilize CART on the residuals
Add indicator variables to the
multiple linear regression equation
for one minus the number of
terminal nodes in the tree
Create new multiple regression
model with variables and indicator
variables
Predictive-ability Statistics
n
2
ˆ
PRESS p   (Yi  Yi (i ) )
i 1
R
2
prediction
1 
PRESS p
SSTO
Analysis Comparison – Top 4-tier Models

Problems with top 4-tier models





Low Adjusted R2
Low Predictive Ability
Over-prediction and under-prediction of fine and
bedrock substrate
Non-normal residuals
Benefit of top 4-tier models

Small number of predictors
Example of Non-normality of Residuals
Top 4-Tier Model
0
-1
-2
-3
Sample Quantiles
1
2
Normal Q-Q Plot
-3
-2
-1
0
1
Theoretical Quantiles
2
3
Analysis Comparison –
Geomorphic plus Top 3-Tier Models

Problems with top geomorphic plus
top 3-tier model




Increase in number of variables
Predictive ability still low
Over-prediction and under-prediction of
fine and bedrock substrate
Some collinearity between variables
Analysis Comparison –
Geomorphic plus Top 3-Tier Models

Benefits with top geomorphic plus
top 3-tier model


Improved predictions
Improved normality of residuals
Comparison of Analysis - CART

Problems with CART





Low predictive-ability
Predicts several observed substrate sizes in
one node
Over-prediction and under-prediction of fines and
bedrock substrate
Omitting one site creates different tree
Benefits of CART


Simple analysis
Missing variables not an issue
2
1
0
-1
-2
LD50 CART Predictions
3
4
CART Predictions
-2
-1
0
1
Observed LD50 Values
2
3
4
Comparison of Analysis-Hybrids

Problems with hybrid models



Increased number of variables
Collinearity with introduction of node
indicator variables
Non-normal residuals
Comparison of Analysis-Hybrids

Benefit of hybrid models



Residuals closer to normal
Increased predictive-ability
Explains some of the variation created
by fitting a linear model to ordinal data
One example: Residual Tree for
Hybrid Geomorphic plus Top 3-Tier Model
Most promising multiple regression prediction model:
Geomorphic plus top 3-tier
Response
LD50
Adjusted PRESSp
R2
for LD50
0.362
504.802
MSPR
2
Rprediction
1.274
0.319
One example: Residual Tree for
Hybrid Geomorphic plus Top 3-Tier Model
slp_elon< 0.3566
|
out_sa< 0.008686
-0.8348
link_slope>=0.002764
CVENTR>=0.1489
topo_wet>=8.152
out_sa>=0.004734
0.6496
0.8367
shed_slp>=14.97
-0.6906
-1.1
-0.1191
link_sa< 0.0431
0.7804
link_sa>=0.08093
-0.6472
b30_r5_l42>=0.929
0.4488
b30_r5_l42< 0.5441
CVCON>=0.4208
avgt_jun< 12.32
CVCON>=0.4342
-0.8996 -0.09977
b30_r5_l42>=0.759
slp_elon< 0.5467 MENTB>=15.63 0.581
-0.9114 0.2462
-0.2892 0.4309
-0.97080.0004686
One example: Observed vs. Predicted for
Hybrid Geomorphic plus Top 3-Tier Model
2
0
-2
Cross-validation LD50 Predictions
4
Plot of predictions against observed LD50
-2
-1
0
1
Observed LD50 Values
2
3
QQ-Plot of Residuals for Hybrid Model
0
-1
-2
-3
Sample Quantiles
1
2
Normal Q-Q Plot
-3
-2
-1
0
1
Theoretical Quantiles
2
3
Coast Range Ecoregion



Less skewed distribution of LD50
No measurements are outliers
Similar ecosystem throughout
region
Ecoregion Distributions
Willamette Valley
Snake River Plain
Puget Low land
level.3.ecoregion
Northern Rockies
Northern Basin and Range
North Cascades
Klamath Mountains
Eastern Cascades Slopes and Foothills
Columbia Plateau
Colorado Plateau
Coast Range
Cascades
Blue Mountains
-3
-1
1
LD50
3
Coast Range EMAP Sites
LD50 key
-2.11
-1.29
-0.46
0.75
1.13
1.51
1.80
2.10
3
3.75
Top 4-Tier Coast Range Model

Predictors




Average aspect (climatic)
Average watershed elevation (geomorphic)
% watershed as volcanic geologic type
(geologic)
% wetlands (distance weighted and buffered)
QQ-Plot: Top 4-Tier Coast Range
0
-1
-2
Sample Quantiles
1
2
Normal Q-Q Plot
-2
-1
0
Theoretical Quantiles
1
2
2
1
0
-1
-2
-3
Cross-Validated LD50 Predictions
3
Observed versus Predicted:
Top 4-Tier Coast Range Model
-2
-1
0
1
Observed LD50
2
3
Coast Range Model
Top Geomorphic Variables
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Average watershed elevation (m)
Drainage density
Mean slope within a 300-meter buffer
Ratio of width of stream to width of floodplain
Coefficient of average hill connectivity
Distance to the first tributary (m)
Percent of landscape with less than 4% slope
Percent of landscape with less than 7% slope
Measure of size and complexity of river
Percent of stream as cascade
Distance-weighted stream power
Watershed relief divided by its length
QQ-Plot: Coast Range
Geomorphic plus Top 3-Tier model
0
-1
-2
-3
Sample Quantiles
1
2
Normal Q-Q Plot
-2
-1
0
Theoretical Quantiles
1
2
1
0
-1
-2
-3
Cross-validation LD50 Predictions
2
3
Observed versus Predicted:
Coast Range Geomorphic + Top 3-Tier
-2
-1
0
1
Observed LD50
2
3
CART - Coast Range Ecoregion
3
2
1
0
-1
-2
CART Predicted LD50 Values
4
Predictions versus Observed LD50
-2
-1
0
1
Observed LD50 Values
2
3
4
Coast Range: Hybrid Models

Benefits of hybrid




Improved prediction
Improved fit
Improved normality of residuals
Problems with hybrid


Increased number of predictors
Collinearity with node indicator
variables
QQ-Plot:
Coast Range Hybrid Top 4-Tier
0
-1
-2
-3
Sample Quantiles
1
2
Normal Q-Q Plot
-2
-1
0
Theoretical Quantiles
1
2
1
0
-1
-2
Cross-Validation LD50 Predictions
2
3
Observed versus Predicted:
Coast Range Hybrid Top 4-Tier
-2
-1
0
1
Observed LD50 Values
2
3
QQ-Plot: Coast Range
Hybrid Geomorphic plus Top 3-Tier
0
-1
-2
Sample Quantiles
1
2
Normal Q-Q Plot
-2
-1
0
Theoretical Quantiles
1
2
Observed versus Predicted:
0
-2
-4
Cross-validation LD50 Predictions
2
Coast Hybrid Geomorphic plus Top 3-Tier
-2
-1
0
1
Observed LD50
2
3
Comparison of Coast Models
Model
Adjusted R2
2
Rprediction
Top 4-tier
0.384
0.362
Geomorphic plus top-3
0.548
0.495
NA
0.087
Top 4-tier hybrid
0.552
0.503
Geomorphic plus top-3 hybrid
0.700
0.614
CART
Conclusions




LD50 is difficult to predict
Additional geomorphic predictors
increases prediction ability
Hybrid models increase prediction
ability
More success in Coast Range
Ecoregion
Future Work

Logistic Regression



Ordinal data treated as continuous in
this study
12 categories might require more
sophisticated methods
Spatial Analysis

Appears to be spatial correlation in
distribution of LD50
Download