EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French

advertisement
EXPLORING SPATIAL
CORRELATION IN RIVERS
by Joshua French
Introduction
A city is required to extends its sewage pipelines farther in
its bay to meet EPA requirements.
How far should the pipelines be extended?
The city doesn’t want to spend any more money than it
needs to extend the pipelines. It needs to find a way to
make predictions for the waste levels at different sites in
the bay.
With the passage of the Clean Water Act in the 1970’s,
spatial analysis of aquatic data has become even more
important.
Section 305 b) requires state governments to make, “a
description of the water quality of all navigable waters in
such State. . .”
It is not physically or financially possible to make
measurements at all sites. Some sort of spatial
interpolation will need to be used.
Usually we might try to fit some sort of linear
model to the data to make predictions. Usually
we assume observations are independent.
For spatial data however, we intuitively know that
two sampling sites close together will probably
be similar.
We would expect that two sites in close proximity
would be more similar than two sites separated
by a great distance.
We can use the correlation between sampling sites
to make better predictions with our model.
The Ohio River
The Road Ahead
- Methods
-
Introduction to the Variogram
Exploratory Analysis
Sample Variogram
Modeling the Variogram
- Analysis
- 3 types of results
- Conclusions
- Future Work
Introduction to the Variogram
Spatial data is often viewed as a stochastic
process.
For each point x, a specific property Z(x) is
viewed as a random variable with mean µ,
variance σ2, higher-order moments, and a
cumulative distribution function.
Each individual Z(xi) is assumed to have its
own distribution, and the set
{Z(x1),Z(x2),…} is a stochastic process.
The data values in a given data set are
simply a realization of the stochastic
process.
We want to measure the relationship
between different points. Define the
covariance for Z(xj) and Z(xk) to be:
Cov(Z(xj),Z(xk))=E[{Z(xj)-µ(xj)} {Z(xk)-µ(xk)}]
where µ(xj) and µ(xk) is the mean of Z at
each respective location.
However, we have a problem. We don’t
know the means at each point because we
only have one realization.
To solve this, we must assume sort of
stationarity – certain features of the
distribution are identical everywhere.
We will work with data that satisfies secondorder stationarity.
Second-order stationarity means that the
mean is the same everywhere: i.e.
E[Z(xj)]=µ for all points xj.
It also implies that Cov(Z(xj),Z(xk)) becomes
a function of the distance xj to xk.
Thus,
Cov(Z(xj),Z(xk)) = Cov(Z(x),Z(x+h))
= Cov(h)
where h measures the distance between
two points.
We can then derive that
Cov(Z(x),Z(x+h)) =E[(Z(x)-µ)(Z(x+h)- µ)]
= E[(Z(x)(Z(x+h))-µ2]
Sometimes it is clear that our data is not
second-order stationary.
Georges Matheron solved this problem in
1965 by establishing his “intrinisic
hypothesis”.
For small distances h, Matheron held that
E[Z(x)-Z(x+h)]=0
Looking at the variance of differences, this
leads to
Var[Z(x)-Z(x+h)] =E[ (Z(x)-Z(x+h))2 ]
= 2 γ(h)
Intrinsic stationarity is good because
analysis may be conducted even if
second-order stationarity is violated.
Unfortunately, the covariance equation is
not defined for intrinsic stationarity.
For this reason, we will work with data that is
second-order stationarity. If second-order
stationarity is violated by the original data,
then we will perform additional procedures
to work with data that is second-order
stationary.
Note that second-order stationarity implies
intrinsic stationarity, so the variogram
equation is still defined.
Under second-order stationarity,
γ(h)=Cov(0)-Cov(h).
γ(h) is known as the semi-variogram. In
practice however, it is usually referred to
as the variogram.
Things to know about variograms:
1. γ(h)= γ(-h). Because it is an even
function, usually only positive lag
distances are shown.
2. Nugget effect - by definition, γ(0)= 0. In
practice however, sample variograms
often have a positive value at lag 0. This
is called the “nugget effect”.
3. Tend to increase monotonically
4. Sill – the maximum variance of the
variogram
5. Range – the lag distance at which the sill
is reached
The following figure shows these features
1.5
Variogram Example
0.5
Variance
1.0
sill
nugget
0.0
range
0
1
2
3
Lag Distance
4
5
Exploratory Analysis
Before we model variograms, we should explore
the data.
We need to make sure that the data analyzed
satisfies second-order stationarity
We need to check for outliers
We need to make sure that the data is not too
badly skewed (G1>1)
6
4
2
0
Square Root of Percent Invertivore
8
10
We can look at the river data as a one-dimensional
linear system. It is fairly easy to check for
stationarity using a scatter plot.
0
200
400
600
RMI
800
1000
If there is an obvious trend in the data, we
should remove it and analyze the
residuals.
If the variance increases or decreases with
lag distance, then we should transform the
variable to correct this.
0
20
40
60
80
To check for outliers, we may use a typical boxplot.
If the data contains outliers, we should do analysis
both with and without outliers present.
5
4
3
2
Observed
6
7
8
If G1>1, then we should transform the data to
approximate normality if possible. To check
approximate normality, the standard qqplot can
be used.
-2
0
Quantiles of Standard Normal
2
3.3 The Sample Variogram
One of the previous definitions of semivariance is:
1
γ(h)  E [ ( Z( x )  Z( x  h) )2 ].
2
The logical estimator is:
N(h)
1
2
ˆγ(h) 
[ z(x j )  z(x j  h) ]

2N(h) j1
where N(h) is the number of pairs of observations
associated with that lag.
60000
40000
20000
Variance
80000
100000
Sample Variogram Example
0
20
40
Lag Distance
60
80
Modeling the Variogram
Our goal is to estimate the true variogram of
the data.
There were four variogram models used to
model the sample variogram: the
spherical, Gaussian, exponential, and
Matern models.
0.6
0.4
0.2
Exponential
Spherical
Gaussian
Matern
0.0
Variance
0.8
1.0
Variogram Models
0
1
2
3
Lag Distance
4
5
6
The algorithm used to fit the spherical model
uses least squares.
The algorithm used to fit the exponential,
Gaussian, and Matern models is maximum
likelihood.
The spherical model is fit to get an estimate
of the sill, nugget, and range.
These estimates will be used to fit the other
three models.
The “best model” will be the model that
minimizes the AICC statistic.
Analysis
The data analyzed is a set of particle size
and biological variables for the Ohio River.
The data was collected by The Ohio River
Valley Sanitation Commission. This is
better known as ORSANCO.
ORANSCO data collection
There were between 190 and 235 unique
sampling sites, depending on the variable.
Some sites had more than one observation.
In these situations, the average value for
the site was used for analysis.
Ohio River Sampling Sites
39
Cincinnati, OH
38
Louisville, KY
37
Latitude (NAD27)
40
Pittsburgh, PA
Cairo, IL
-88
-86
-84
Longitude (NAD27)
-82
-80
There were two main types of data: particle
size data and biological levels.
The particle size data measured percent
gravel, percent sand, percent fines,
percent hardpan, percent boulder, and
percent cobble.
The biological data measured
- Number of individuals at a site
- Number of species at a site
- Percent tolerant fish
- Percent simple lithophilic fish (fish that lay eggs
on rocks)
- Percent non-native fish
- Percent detritivore fish (fish that eat mostly
decomposed plants or animals)
- Percent invertivore (fish that eat mostly
invertebrate animals)
- Percent Piscivore (fish that eat mostly other fish)
The results of the analysis fell into three
main groups:
- Sample variogram fit well
- Sample variogram did not fit well
- Analysis not reasonable
Good Results: Number of
Individuals at a site
Skewness coefficient of data is 8.16. This is
much too high.
The data is transformed using the natural
logarithm
New skewness coefficient is reduced to .56.
Not perfect, but much less skewed.
7
6
5
4
log(Number of Individuals)
8
Check Normality of log(Num Individuals)
-3
-2
-1
0
Quantiles of Standard Normal
1
2
3
7
6
5
4
log(Number of Individuals)
8
Check Second-Order Stationarity of log(Num Individuals)
0
200
400
600
RMI
800
1000
4
5
6
7
8
Check for outliers of log(Num Individuals)
There are a number of outliers for the
transformed variable
We should do analysis with and without the
outliers present
0.40
0.35
0.30
Variance
0.45
0.50
0.55
log(Num Individuals) Sample Variogram
with outliers
0
50
100
150
Lag Distance (Mi)
200
250
6.0
5.5
5.0
4.5
4.0
log(Num Individuals) w/o outliers
6.5
Check normality of log(Num Individuals)
without outliers
-3
-2
-1
0
Quantiles of Standard Normal
1
2
3
0.25
0.20
Variance
0.30
log(Num Individuals) Sample Variogram
without outliers
0
50
100
150
Lag Distance (Mi)
200
250
We were not able to model the sample
variogram perfectly, but we were able to
detect some amount of spatial correlation
in the data, especially when the outliers
were removed.
For the transformed variable without outliers,
the exponential model estimated the
nugget to be .20, the sill to be .2709, and
the range to be 37.7 miles.
Poor Results: Percent Sand
Skewness coefficient only .18, so skewness
not a major factor.
Check second-order stationarity using
scatter plot.
60
40
20
0
Percent Sand
80
100
Check Stationarity of Percent Sand
0
200
400
600
RMI
800
1000
There appears to be a trend in the data.
After removing the trend, the data appears
to be second-order stationary.
The residuals are also approximately
normal.
20
0
-20
-40
-60
Percent Sand Residuals
40
Check stationarity of percent sand residuals
0
200
400
600
RMI
800
1000
0
-20
-40
-60
sand$resid
20
40
Check normality of percent sand residuals
-3
-2
-1
0
Quantiles of Standard Normal
1
2
3
550
500
450
400
Variance
600
650
700
Sample Variogram of percent sand residuals
0
50
100
150
Lag Distance (Mi)
200
250
The sample variogram does not really
increase monotonically with distance.
Our variogram models cannot fit this very
well.
Though we can obtain estimates of the
nugget, sill, and range, the estimates
cannot be trusted.
No results: Percent Hardpan
This variable was so badly skewed that
analysis was not reasonable.
The skewness coefficient is 12.38. This is
extremely high.
150
100
50
0
Percent Hard Pan
200
250
QQplot of Percent Hardpan
-3
-2
-1
0
Quantiles of Standard Normal
1
2
3
150
100
50
0
Percent Hard Pan
200
250
Scatter plot of Percent Hardpan
0
200
400
600
RMI
800
1000
The data is nearly all zeros!
There is also an erroneous data value. A
percentage cannot be greater than 100%.
Data analysis does not seem reasonable.
Our data does not meet the conditions
necessary to use the spatial methods
discussed.
Conclusions
Able to fit sample variogram reasonably well
– percent gravel, number of individuals,
number of species
Not able to fit sample variogram well –
percent sand, percent detritivore, percent
simple lithophilic individuals, percent
invertivore
No results – remaining variables
Summary of Results
Response
Transformation
Percent Gravel
Percent Sand
Percent Cobble
Percent Hardpan
Percent Fines
Percent Boulder
Number of Individuals
Natural Log
Number of Individuals
Natural Log (no outliers)
Number of Native Species
Percent Tolerant Individuals
Percent Lithophilic Individuals
Square Root
Percent Nonnative Individuals
Percent Detritivore
Square Root
Percent Detritivore
Square Root (no outliers)
Percent Invertivore
Square Root
Percent Piscivore
Trend Removed
Model Nugget Sill
Range
Exponential 286.09 335.53 72.9 miles
38.1082+.0330x Gaussian 520.88 658.32 71.67 miles
Gaussian
0.29
Exponential 0.2
17.7849-.0042x Gaussian
10.1
0.39 44.19 miles
0.27 37.69 miles
11.87 39.93 miles
15.5364-.0023x
0.92
2.76
44.02 miles
Exponential 1.09
Exponential 0.94
6.5207-.0039x Exponential 1.4
1.57
1.4
2.97
24.08 miles
19.17 miles
13.43 miles
Matern
Future Work
Data set involving three streams in Norfolk,
Virginia. Each stream has 25 observations.
Collected by researchers at Old Dominion
University.
Difficulties to overcome
- What is the best way to measure distance
between points?
- Few observations
- Overlapping points after coordinate conversion
Problem: What is the best way to measure
distance between points?
There is some aspect of two-dimensionality
to the data, but it is still really a onedimensional problem.
Paradise Creek Region of Interest
0.4
0.2
0.0
UTMY
0.6
Paradise Creek Sampling Sites
0.0
0.2
0.4
0.6
0.8
UTMX
1.0
1.2
1.4
Problem: 25 observations per stream is
considered the minimum number of points
to create a variogram
- the sample variogram will be very rough
- our variogram model estimates will
probably be bad
To correct this, we will explore the possibility
of combining the data from the three
streams
Problem: Overlapping points after
conversion
- Original data in longitude/latitude
coordinates
- Convert to UTM coordinates so that
Euclidian distance makes sense
- Converted UTM coordinates often result
in overlapping sites (and even fewer
unique sampling sites)
36.806
36.804
36.802
36.800
Latitude (NAD27)
36.808
Stream Sampling Sites (Lat/Long)
-76.290
-76.288
-76.286
Longitude (NAD27)
-76.284
-76.282
4083200
4082800
UTM Y
4083600
Stream Sampling Sites (UTM)
920200
920400
920600
UTM X
920800
921000
36.806
36.804
36.802
36.800
Latitude (NAD27)
36.808
Stream Sampling Sites (Lat/Long)
-76.290
-76.288
-76.286
Longitude (NAD27)
-76.284
-76.282
4083200
4082800
UTM Y
4083600
Stream Sampling Sites (UTM)
920200
920400
920600
UTM X
920800
921000
Acknowledgments
- My committee: Dr. Urquhart, Dr. Wang,
and Dr. Theobald
- Dr. Davis and Dr. Reich for answering my
spatial questions and letting me use their
S-Plus spatial library
Concluding Thought
Before you criticize someone, you should
walk a mile in their shoes. That way, when
you criticize them, you’re a mile away and
you have their shoes.
- Jack Handey
Download