Lecture #1: a general overview of spatial autocorrelation in

advertisement
Lecture #1:
a general overview
of spatial
autocorrelation in
georeferenced data
Spatial statistics in practice
Center for Tropical Ecology and Biodiversity,
Tunghai University & Fushan Botanical Garden
ADMINISTRATION
Who knows what linear regression is?
Who knows what nonlinear regression is?
Who knows what generalized linear modelling
is?
Who knows what spatial autocorrelation is?
Does it mean redundant information?
Does it mean variables missing from a
model?
Does it mean spatial dependency?
Does it mean a nuisance parameter is
present?
Topics for today’s lecture
• The nature of data and its information content.
• What is spatial autocorrelation?
• Visualizing spatial autocorrelation: Moran
scatterplots, semivariogram plots, and maps.
• Defining and articulating spatial structure:
topology and distance perspectives; contagion
and hierarchy concepts.
• Necessary concepts from multivariate statistics.
• An example of the elusive negative spatial
autocorrelation.
• Some comments about spatial sampling.
• Implications about space-time data structure.
High Peak district biomass index:
ratio of remotely sensed data spectral
bands B3 and B4
Spatially autocorrelated
Geographically random
The permutation perspective
The magic box is a physical model
of spatial autocorrelation
Slider
puzzles
The nature of data and its
information content
• Data – symbols (e.g., numbers), signs,
impulses, and the like, which in and of
themselves have no inherent meaning
• Information – [from inform (to tell facts)] data
to which meaning has been attached,
converting numbers, for example, to
messages (i.e., facts) about circumstances,
situations, events, and so on
• Evidence – information analytically converted
to knowledge about whether or not some
proposition is true or valid
Berry’s geographic matrix
location
time
attributes
Variable 1 Variable 2 … Variable P
attributes
areal
unit
1
location
1 Variable 2 … Variable P
areal unitVariable
2
attributes
.1
areal
unit
location
.
1
Variable 2 … Variable P
.2
areal unit Variable
geographic
.1
areal unitareal
unit n
associations
.
areal unit .2
geographic
.
areal
unit
n
distribution
.
geographic
.
areal unit n
fact
The nature of georeferenced data
and its information content
• Geocoded data – data that are tagged to
specific points on a two-dimensional surface
(e.g., latitude and longitude for the Earth)
• Geographic information – data to which
meaning has been attached in part through their
absolute and/or relative geographic contexts
• Geographic evidence – information analytically
converted to knowledge in part by accounting
for the presence of latent spatial
autocorrelation
What is needed for spatial statistics:
1. An attribute data set
2. A map
3. Locational tags linking the attribute
data set to the map
4. A topological structure matrix
depicting the arrangements of
locations on a map
Spatial autocorrelation can be
interpreted in different ways
As a spatial process mechanism – spatial diffusion
As a diagnostic tool – the Cliff-Ord Eire example (the model specification should be nonlinear)
As a nuisance parameter – eliminating spatial dependency to avoid statistical complications
As a spatial spillover effect – georeferencing of pediatric lead poisoning cases in Syracuse, NY
As an outcome of areal unit demarcation – the modifiable areal unit problem (MAUP)
As redundant information – spatial sampling; map interpolation
As map pattern – spatial filtering (to be discussed in this course)
As a missing variables indicator/surrogate – a possible implication of spatial filtering
As self-correlation – what is discussed next
Defining spatial
autocorrelation
Auto: self
Correlation: degree of
relative correspondence
Positive: similar values
cluster together on a map
Negative: dissimilar values
Cluster together on a map
USA: understanding spatial autocorrelation
POLLUTION MONITORING
HOUSEHOLD SAMPLING
SATELLITE IMAGE
AGRICULTURAL EXPERIMENT
The SASIM
game
http://www.nku.edu/~longa/cgi-bin/cgi-tcl-examples/generic/SA/SA.cgi
The nature of data and its
information content: analysis tools
• Scatterplot: a two-dimensional visual
portrayal of the relationship between two
variables that aids in interpreting a correlation
coefficient or linear regression trend line
• Correlation: a global summary measure
(ranging between -1 and 1) that indexes the
degree to which a regression trend line
characterizes its corresponding scatterplot
• Linear regression: a technique used to
determine the straight line trend exhibited by
a scatterplot
Graphic examples
r=1
r = 0.95
perfect
positive
marked
positive
r = 0.26
r = 0.51
moderate
positive
regression
trend
line in
blue
r = 0.72
r = 0.01
weak
positive
r = -0.71
strong
negative
strong
positive
trace
positive
r = -1
perfect
negative
Describing a scatterplot trend
positive relationship:
High Y with High X
& Medium Y with Medium X
& Low Y with Low X
negative relationship:
High Y with Low X
& Medium Y with Medium X
& Low Y with High X
Georeferenced data scatterplots
• The horizontal axis is the measurement
scale for some attribute variable
• The vertical axis is the measurement scale
for neighboring values (topological
distance-based) of the same attribute
variable
OR
• The horizontal axis is (usually) Euclidean
distance between geocoded locations
• The vertical axis is the measurement scale
for geographic variability
Description of the Moran scatterplot
Positive spatial autocorrelation
2002 population
- high values tend to be
density
surrounded by nearby high values
- intermediate values tend to be surrounded
by nearby intermediate values
- low values tend to be surrounded by
nearby low values
MC = 0.49
GR = 0.58
Description of the Moran scatterplot
Negative spatial autocorrelation competition for space
- high values tend to be
surrounded by nearby low values
- intermediate values tend to be surrounded
by nearby intermediate values
- low values tend to be surrounded by
nearby high values
MC = -0.16
sMC = 0.075
GR = 1.04
Tasseled cap SBI (soil brightness
index) data for the High Peak District
SBI = 0.332B1 + 0.331B2 + 0.552B3 +
0.452B4 + 0.481B5 + 0.252B6
TSBI = [SBI – 79 + 168(rSBI/900)3]0.6
DEM data for Puerto Rico
Elevation mean
by municipio
Elevation standard deviation
by municipio
Population density across the Cusco
Department of Peru, by district
raw
population
densities
LN+
population
densities
As distribution across a polluted
geographic landscape
#
## #
#
#
#
#
# #
#
#
#
# #
#
##
##
#
#
#
##
#
# #
# # #
#
#
##
##
# ## #
##
# #
#
##
#
##
#
#
#
# # # #
#
#
#
# #
# #
# ### ##
#
# #
# ###
#
###
## #
#
#
#
#
#
### # #
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
#
# ###
##
##
#
#
# # # #
# # #
#
#
#
#
## #
#
#
#
#
#
#
#
# # # ##
#
#
#
# # ##
#
#
# #
# # #
#
# #
#
# ##
#
#
#
#
#
#
#
#
#
# #
#
#
# # #
#
#
#
#
#
#
#
#
#
#
#
##
# #
#
#
#
#
#
##
#
#
#
#
#
Murray f(As) - continued
#
#
#
#
## #
#
# #
#
#
#
# ##
# #
# ##
# # # #
#
# # #
##
#
#
# ## #
# #
# #
# #
##
#
#
# ## # # # #
#
#
#
#
#
# #
#
#
#
#
#
# # #
#
#
#
#
# ## #
# #
##
#
# #
# #
#
#
# #
#
##
#
#
#
#
#
#
#
# #
#
#
# #
#
#
#
#
#
#
#
#
#
# #
# #
#
#
#
#
# #
#
# #### #### # # #####
##
## ##
#
#
#
#
#
#
#
# # # # # #
#
## #
#
#
#
#
# #
#
#
#
#
#
#
#
#
#
# #
#
#
# #
#
#
#
#
# ##
#
#
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
#
#
#
# ##
#
#
#
#
#
# ##
#
#
#
#
# #
# #
#
# # #
#
n
1(y
i
 y)(x i  x)/n
Spatial autocorrelation
i 1
n
 (y
n
i
 y)
i 1
n
2
 (x
i
i 1
n
n
from r
to MC
 x)
2
n
  c (y
ij
n
i
 y)(y j  y)/
i 1 j1
n

n
 c
i 1 j1
n
(y i  y) 2
i 1

(y i  y) 2
i 1
n
n
ij
Measures of spatial autocorrelation
MC: Moran Coefficient; GR: Geary Ratio; semivariogram
range: -1, -1/(n-1), 1
MC 
n
n
n
 (y
 c (y
ij
j
 y)
j1

c ij
(y i  y) 2
i 1
range: 0,1, 2
n

2
i 1 j1
c ij (y i  y j ) 2
i 1 j1
n
n
c ij

i 1
γ(d k ) 
nk

(y i  y j )
i 1
n

n -1
n
 y)
n
i 1 j1
GR 
i
i 1
n

n
(y i  y) 2
2*n k
2

f( d k ), k  1,2,..., K
Spherical
Exponential
Bessel function (1st order, 2nd kind)
Graphical
portrayals of spatial
autocorrelation
latent in the
transformed As
data
Adair County, MO, 1990 population density. Upper left:
geographic connectivity matrix. Upper right: geographic
distribution of density. Lower left: Moran scatterplot. Lower
right: semivariogram plot & Bessel function.
Salem County, VA, 1990 population density. Upper left:
geographic connectivity matrix. Upper right: geographic
distribution of density. Lower left: Moran scatterplot.
Lower right: semivariogram plot & wave-hold function.
Tessellations (i.e., surface partitionings)
square
rook
hexagon
queen
spatial autocorrelation
field extents
1st
order
rook
2nd
order
hexagon
queen
Geographic structure from surface
partitioning topology
• Create an n-by-n table whose rows and 1 10 2
2
0
columns are labeled with the same
…
sequence of n locations
n
• In response to the question “Is the location
labeling a row adjacent to the location labeling
a column?”, code the (row, col) table cell 0 if
the answer is “No” and 1 if the answer is
“Yes.”—this is matrix C.
• Divide each cell entry of matrix C by its
corresponding row sum to create matrix W.
… n
0
0
The binary and row-standardized
geographic connectivity matrices
surface
partitioning
Denotes adjacency
A
B
C
D
matrix C
associated
geographic connectivity/
weights matrix
matrix W
A B C D
A 0 1 1 0
A B C D
A
0
B 1 0 0 1
C 1 0 0 1
D 0 1 1 0
B
C
D
0.5
0
0
0.5
0.5
0
0
0.5
0
0.5 0.5
0.5 0.5
0
0
Geographic structure from spatial flows
geographic hierarchy
twodimensional
manifestation
of geographic
hierarchy
 P P 

c ij  I ij  κ
 e



α β
i
j
γ d ij
w ij 
β -γ d ij
j
Pe
n
P e
j1
β - γ d ij
j
Geographic structure from
interpoint distances
distance
lags
anisotropy
The semivariogram plot is a
scatterplot of average squared
paired comparisons versus
average interpoint distance for
grouped georeference data
Quadrats of a Moran scatterplot
Q1 (values [+], sum of neighboring values [+]): H-H
Q3 (values [-], sum of surrounding values [-]): L-L
CZ
Q2
Q1
Locations of positive spatial association
(“I’m similar to my neighbors”).
0
Q3
Q2 (values [+], sum of neighboring values [-]): H-L
Q4
0
z
Q4 (values [-], sum of neighboring values [+]): L-H
Locations of negative spatial association
(“I’m different from my neighbors”).
Syracuse population density
ON
VT
II 8811
NY
Syracuse
MA
CT
PA
NJ
Syracuse Population Density
Very High
High
Medium
Low
Very Low
Interstate
State Boundary
Canadian Border
II 448811
II66
9900
II 81
81
II 9900
MC = 0.70
GR = 0.29
90
II 90
Moran scatterplots
MC = 0.28
GR = 0.68
MC = 0.64
GR = 0.37
MC = 0.70
GR = 0.29
population density; % widowed % male; % with univ. degree
MC = 0.41
GR = 0.65
MC = 0.23
GR = 0.82
black/white ratio
MC = 0.48
GR = 0.63
Houston population density
MC = 0.53
GR = 0.43
Moran scatterplots
MC = 0.53
GR = 0.43
MC = 0.64
GR = 0.37
MC = 0.36
GR = 0.64
population density; % widowed % male; % with univ. degree
black/white ratio
MC = 0.53
GR = 0.46
MC = 0.57
GR = 0.43
MC = 0.71
GR = 0.29
What is the spatial
autocorrelation here?
raw
births/deaths
transformed
MC = 0.65
GR = 0.32
What is the spatial
autocorrelation here?
raw
% 100+ years old
transformed
MC = 0.42
GR = 0.56
What is the spatial
autocorrelation here?
raw
births/females15-44
transformed
MC = 0.63
GR = 0.28
Geostatistical terms
• Valid semivariogram model: resulting
covariation matrix is positive-definite
• (Effective) range: the distance at which spatial
autocorrelation becomes (effectively) zero
• Sill: the covariation that a semivariogram
tends to when distance becomes very large
• Nugget: a discontinuity at distance = 0
• Isotropy: spatial dependency changes only
with inter-point distance, not direction
• Anisotropy: spatial dependency changes with
inter-point distance and direction
Semivariogram modeling for
population density across China
5,659,641
distance pairs
Too much global structure!!!
30% accounted for by a
linear gradient;
+ 10% accounted for by a
quadratic gradient.
Semivariogram model
descriptions: poor trend line fits!
5
Semivariogram model
descriptions after geographic detrending
good
description
Redundant information
• Much analysis is devoted to analyzing
redundant (i.e., duplicate) information
• Conventional analysis: the degree to which
a scatter of points, for example, aligns
along a straight line (multicollinearity)
• Spatial statistical analysis: the degree to
which attribute values in one location
replicate the information content of attribute
values in neighboring locations
Review: multivariate analysis
http://obelia.jde.aca.mmu.ac.uk/multivar/intro.htm
Multivariate regression & general linear model
N-way ANOVA
Principal Components & Factor Analysis
Cluster analysis
Generalized linear model
Multivariate analysis: assumptions
1. Linearity
2. Constant variance
3. Normality (bell-shaped curve)
4. Independence (e.g., zero spatial
autocorrelation)
5. Random sampling (design based)/
stochastic model (model based) error
6. No measurement error
7. No multicollinearity
ANOVA
Goal: divide observations into groups,
and assess the within- and
between-groups variation.
Use: evaluating differences of group
means
Accompanying tests for homogeneity
of variance:
Levene – if attribute is continuous
Bartlett – if attribute is normally
distributed within groups
Regression analysis
• The workhorse of conventional statistics:
almost any classical statistical technique can
be expressed as a linear regression problem
•
Y  Xβ  ε , with most
assumptions attached to the error term, most
notably the Gaussian distribution
• Parameter estimates often OLS or MLE
• Has been extended to nonlinear and
genrealized linear model specifications
Nonlinear & generalized linear
models
• Nonlinear regression often employs
a normally distributed error term; its
parameters usually are estimated
with MLE
• Generalized linear models refer, in
part, to Poisson and Binomial
regression models; their
parameters often are estimated with
weighted least squares
Some guidelines
• General interval/ratio data – linear regression
• Counts data – Poisson regression
• Percentage/binary data – binomial/logistic
regression
• Quick approximations for counts/percentage
data – transformed response for linear
regression
• Complicated multiplicative specifications –
nonlinear regression
• Rankings – nonparametric techniques
Principal components analysis (PCA)
• A critical concept in PCA is the
eigenfunction
• Orthogonal components of a matrix
are calculated, and then used to
determine uncorrelated linear
combinations of original variables
det( R  λI )  0
(R  λI )E  0
• λ are the eigenvalues
• E are the eigenvectors
Theoretical eigenfunction properties
P1: all of the eigenvalues and eigenvectors of a real symmetric matrix
consist of real (rather than complex or imaginary) numbers
P2: the eigenvalues of a matrix and its transpose are the same
P3: the sum of the eigenvalues of a matrix equals the sum of that matrix's
principal diagonal elements (i.e., its trace)
P4: if some constant b is added to each element in the diagonal of a matrix,
then b is added to each of the eigenvalues of that matrix, but the
matrix's eigenvectors remain unchanged
P5: if a matrix is multiplied by some scalar constant, then its eigenvalues
also are multiplied by this constant, but its eigenvectors remain
unchanged
P6: the product of a matrix's eigenvalues equals the value of that matrix's
determinant
P7: the eigenvalues of a matrix and its inverse are inverses of each other,
while the eigenvectors are the same
P8: if a matrix is powered by some positive integer value, each of its
eigenvalues is powered by this same positive integer value, but its
eigenvectors remain unchanged
Eigenfunction properties - continued
P9: for a symmetric matrix, two eigenvectors associated with two distinct
eigenvalues are mutually orthogonal (i.e., EhTEk = 0, h  k)
P10: for a real symmetric matrix, the transpose of the eigenvector matrix
extracted from it equals the inverse of this eigenvector matrix (i.e., ET
= E-1)
P11: the eigenvalues of a triangular or diagonal matrix are the elements in
its principal diagonal
P12: the principal (i.e., largest or dominant) eigenvalue of a matrix is
contained in that interval defined by the largest and smallest row sums
for this matrix, where these sums are of the absolute values of the row
cell entries
P13: the principal eigenvector of a nonnegative, symmetric matrix has all
nonnegative values.
P14: the principal eigenvalue of a matrix is positive, and no other
eigenvalue of this matrix is greater in absolute value
P15: the sum of the squared eigenvalues of a matrix is less than or equal to
the sum of all of the elements of this matrix, where these sums are of
the absolute values of the cell entries, with exact equality achieved for
binary matrices
Cluster analysis
Goal: to uncover latent natural groups
of areal units
Popular methods:
single linkage
complete linkage
average linkage
Ward’s algorithm (ANOVA-based)
centroid
Spatial analysis usually wants to
include a contiguity constraint.
A surprise! Hidden negative SA
Dark red:
very high
Light red:
high
Gray:
medium
Light green:
low
Dark green:
very low
Standard spatial statistical model
specifications fail to account for all of
the positive SA displayed in the China
population density map because of
hidden negative SA!
map of
hidden
negative SA
pop density
SF residuals
SAR residuals
hidden negative SA
SPATIAL SAMPLING
• Increasing domain sampling
• Infill sampling
Increasing
domain
sampling design
Infill sampling design to estimate a
single regional mean: Murray, Utah
The two extreme
hexagonal
tessellations
employed in the
sampling design
experiment: left n = 104; and,
right - n = 2008
The Syracuse, NY,
pediatric lead poisoning study
The two extreme hexagonal tessellations
employed in the sampling design experiment:
left - n = 70; and, right - n = 2001
N
N
#
N
#
#
N
N N
#
N
N
N
N
N
# N
N
#
N
N
N
N
#
NN
#
N
N
#
N N
#
N N
N
N
#
N
#
N
N#
N
# N
N
N
N
N #
NNN
N
N
N
N
#
N
N
#
#N
#
#
N
#
N
N
N
#
N
N
#
N
N
#
N
N
N
N
N
#
N
N
N
N
#
#N
N
NN
N
N
N
N
#
#
N
N#
N#
N
N
#N
N
#
N
N
NN
N
#
N
N
N
N
N
#
N
N
NN
#
#
#
N
#
N
N #
N
N
N
#
N
N
#
N
N
N
N N
#
N
N
N
N
#
N
#
N
N
#
#
#
#
N
#
N
N
N
#
N
#
N
N#
N
N
N
N
#
N
#
N N
#
N
#
N
#
N #
NN
N
#
N N
N
N
#
N
N
#N
N#
#
N
N
N N
#N
#
N
N
#
N #
N
N
#N
N
N#
NN
N
N
N
N
1
E( σ̂ )  σ 1 V 1/n ,
2
y
2 T
ε
2
V  I  E( σ̂ )  σ /n ,
2
y
2
ε
SAR : V  (I  ρW )(I  ρW )
T
1
TR( V ) 2
σε
2
n
E( σ̂ y ) 
1
TR( V )
n
T
1
1 V 1
1
TR( V )
 n*  T 1 n
1 V 1
VIF
Effective
sample size:
the # of
equivalent iid
observations
A general formula via the SAR model
n* 
1
n 1

 2.12373 ρ̂  0.20024 ρ̂ 
n 1 
(1

e
)
1.92369

n
 1 e

ρ̂  0  n*  n
ρ̂  1  n*  1
A general formula via semivariogram
models
spherical
 1
n -1
1  251.5132(
r
d max
K-Bessel:
1st order, 2nd kind
exponential
 1
)1.9324
n -1
(1  51.4879
r
d max
 1
)1.7576
n -1
(1  69.6698
n* is a function of the range/range parameter of a
semivariogram model, as well as the model itself.
r
d max
)1.8601
A general formula via spatial filter
models
Based on the variance inflation factor notion,
where R2 is from regressing Y on a spatial
filter [i.e., a linear combination of eigenvectors
extracted from (I-11T/n)C(I-11T/n)]:
σ
2
2
σε
2
1

R
E( σ̂ y ) 

2
n
(1  R )n
2
ε
 n*  (1  R )n
2
sample size, n, determination
SAR
( z1 - α/2  z1 - β ) 2 1  e  2.12373 ρ̂  0.20024 ρ̂
σ e2*

Δ2
1  e  1.92349
1
Bessel
1  e  1.92349

( z1-α/2  z1-β ) 2 
r c


1  (C0  C1)

1
(1

b
)
2


d max
Δ


σ e2 *
SF
1 e
 2.12373 ρ̂  0.20024 ρ̂
( z1 - α/2  z1 - β ) 2
2
Δ
1 R 2
power
precision
A peek at space-time
autocorrelation
Yt = f(WYt-1)
V
U
location i at time t
Space-time treatments are beyond
the scope of this course
Some observations:
1. The temporal component tends to be
the stronger one [a one-directional (i.e.,
single), one-dimensional influence]
2. As the average number of neighbors
increases, maximum spatial
autocorrelation tends to decrease.
3. Analysis of a single map implies that
spatial effects occur instantaneously.
Download