EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French

advertisement
EXPLORING SPATIAL
CORRELATION IN RIVERS
by Joshua French
Introduction
A city is required to extends its sewage pipelines farther in
its bay to meet EPA requirements.
How far should the pipelines be extended?
The city doesn’t want to spend any more money than it
needs to extend the pipelines. It needs to find a way to
make predictions for the waste levels at different sites in
the bay.
Usually we might try to interpolate the data using a
linear model. Usually we assume observations
are independent.
For spatial data however, we intuitively know that
response values for points close together should
be more similar than points separated by a great
distance.
We can use the correlation between sampling sites
to make better predictions with our model.
The Road Ahead
- Methods
-
Introduction to the Variogram
Exploratory Analysis
Sample Variogram
Modeling the Variogram
- Analysis
- 3 types of results
- Conclusions
- Future Work
Introduction to the Variogram
Spatial data is often viewed as a stochastic
process.
For each point x, a specific property Z(x) is
viewed as a random variable with mean µ,
variance σ2, higher-order moments, and a
cumulative distribution function.
Each individual Z(xi) is assumed to have its
own distribution, and the set
{Z(x1),Z(x2),…} is a stochastic process.
The data values in a given data set are
simply a realization of the stochastic
process.
For a spatial process, second-order
stationarity is often assumed.
Second-order stationarity implies that the
mean is the same everywhere: i.e.
E[Z(xj)]=µ for all points xj.
It also implies that Cov(Z(xj),Z(xk)) becomes
a function of the distance xj to xk.
Thus,
Cov(Z(xj),Z(xk)) = Cov(Z(x),Z(x+h))
= Cov(h)
where h measures the distance between
two points.
Looking at the variance of differences
Var[Z(x)-Z(x+h)] =E[ (Z(x)-Z(x+h))2 ]
= 2 γ(h)
Assuming second-order stationarity,
γ(h)=Cov(0)-Cov(h).
γ(h) is known as the semi-variogram.
The plot of γ(h) on h is known as the
variogram.
Things to know about variograms:
1. γ(h)= γ(-h). Because it is an even
function, usually only positive lag
distances are shown.
2. Nugget effect - by definition, γ(0)= 0. In
practice however, sample variograms
often have a positive value at lag 0. This
is called the “nugget effect”.
3. Tend to increase monotonically
4. Sill – the maximum variance of the
variogram
5. Range – the lag distance at which the sill
is reached. Observations are not
correlated past this distance.
The following figure shows these features
1.5
Variogram Example
0.5
Variance
1.0
sill
nugget
0.0
range
0
1
2
3
Lag Distance
4
5
Exploratory Analysis
The data studied is the longitudinal profile of
the Ohio River.
Instead of worrying about the river network
with streams, tributaries, and other factors,
we simply look at the Ohio River as a onedimensional object.
The Ohio River
Longitudinal Profile of the Ohio
River Sampling Sites
39
Cincinnati, OH
38
Louisville, KY
37
Latitude (NAD27)
40
Pittsburgh, PA
Cairo, IL
-88
-86
-84
Longitude (NAD27)
-82
-80
Before we model variograms, we should explore
the data.
We need to make sure that the data analyzed
satisfies second-order stationarity
If there is an obvious trend in the data, we should
remove it and analyze the residuals.
If the variance increases or decreases with lag
distance, then we should transform the variable
to correct this.
6
4
2
0
Square Root of Percent Invertivore
8
10
It is fairly easy to check for stationarity of this data
set using a scatter plot.
0
200
400
600
RMI
800
1000
If the data contains outliers, we should do
analysis both with and without outliers
present.
If G1>1, then we should transform the data
to approximate normality if possible.
3.3 The Sample Variogram
One of the previous definitions of semivariance is:
1
γ(h)  E [ ( Z( x )  Z( x  h) )2 ].
2
The logical estimator is:
N(h)
1
2
ˆγ(h) 
[ z(x j )  z(x j  h) ]

2N(h) j1
where N(h) is the number of pairs of observations
associated with that lag.
60000
40000
20000
Variance
80000
100000
Sample Variogram Example
0
20
40
Lag Distance
60
80
Modeling the Variogram
Our goal is to estimate the true variogram of
the data.
There were four variogram models used to
model the sample variogram: the
spherical, Gaussian, exponential, and
Matern models.
0.6
0.4
0.2
Exponential
Spherical
Gaussian
Matern
0.0
Variance
0.8
1.0
Variogram Models
0
1
2
3
Lag Distance
4
5
6
Analysis
The data analyzed is a set of particle size
and biological variables for the Ohio River.
The data was collected by “The Ohio River
Valley Sanitation Commission. This is
better known as ORSANCO.
There were between 190 and 235 unique
sampling sites, depending on the variable.
ORSANCO data collection
The results of the analysis fell into three
main groups:
- Able to fit the sample variogram well
- Not able to fit the sample variogram well
- Analysis not reasonable
Good Results: Number of
Individuals at a site
After correcting for skewness by doing a log
transformation, there are a number of
outliers. We analyze the data both with
and without the outliers.
0.40
0.35
0.30
Variance
0.45
0.50
0.55
log(Num Individuals) Sample Variogram
with outliers
0
50
100
150
Lag Distance (Mi)
200
250
0.25
0.20
Variance
0.30
log(Num Individuals) Sample Variogram
without outliers
0
50
100
150
Lag Distance (Mi)
200
250
We were not able to model the sample
variogram perfectly, but we were able to
detect some amount of spatial correlation
in the data, especially when the outliers
were removed.
We are able to obtain reasonable estimates
of the nugget, sill, and variance.
Poor Results: Percent Sand
After doing exploratory spatial analysis and
removing a trend, we fit the sample
variogram of the percent sand residuals.
550
500
450
400
Variance
600
650
700
Sample Variogram of percent sand residuals
0
50
100
150
Lag Distance (Mi)
200
250
The sample variogram does not really
increase monotonically with distance.
Our variogram models cannot fit this very
well.
Though we can obtain estimates of the
nugget, sill, and range, the estimates
cannot be trusted.
No results: Percent Hardpan
This variable was so badly skewed that
analysis was not reasonable.
The skewness coefficient is 12.38. This is
extremely high.
150
100
50
0
Percent Hard Pan
200
250
QQplot of Percent Hardpan
-3
-2
-1
0
Quantiles of Standard Normal
1
2
3
150
100
50
0
Percent Hard Pan
200
250
Scatter plot of Percent Hardpan
0
200
400
600
RMI
800
1000
The data is nearly all zeros!
There is also an erroneous data value. A
percentage cannot be greater than 100%.
Data analysis does not seem reasonable.
Our data does not meet the conditions
necessary to use the spatial methods
discussed.
Conclusions
Able to fit sample variogram reasonably well
– percent gravel, number of individuals,
number of species
Not able to fit sample variogram well
– percent sand, percent detritivore, percent
simple lithophilic individuals, percent
invertivore
No results – remaining variables
Summary of Results
Response
Transformation
Percent Gravel
Percent Sand
Percent Cobble
Percent Hardpan
Percent Fines
Percent Boulder
Number of Individuals
Natural Log
Number of Individuals
Natural Log (no outliers)
Number of Native Species
Percent Tolerant Individuals
Percent Lithophilic Individuals
Square Root
Percent Nonnative Individuals
Percent Detritivore
Square Root
Percent Detritivore
Square Root (no outliers)
Percent Invertivore
Square Root
Percent Piscivore
Trend Removed
Model Nugget Sill
Range
Exponential 286.09 335.53 72.9 miles
38.1082+.0330x Gaussian 520.88 658.32 71.67 miles
Gaussian
0.29
Exponential 0.2
17.7849-.0042x Gaussian
10.1
0.39 44.19 miles
0.27 37.69 miles
11.87 39.93 miles
15.5364-.0023x
0.92
2.76
44.02 miles
Exponential 1.09
Exponential 0.94
6.5207-.0039x Exponential 1.4
1.57
1.4
2.97
24.08 miles
19.17 miles
13.43 miles
Matern
Future Work
Things to consider in future analysis:
- The water flows in only one-direction. A
point downstream cannot affect a point
upstream
- Natural features such as tributaries may
impact spatial correlation
- Manmade features such as dams may
impact spatial correlation
Concluding Thought
Before you criticize someone, you should
walk a mile in their shoes. That way, when
you criticize them, you’re a mile away and
you have their shoes.
- Jack Handey
Download