Lecture #2: quantitative regionalization and cluster detection, with

advertisement
Lecture #2:
quantitative
regionalization and
cluster detection, with
special reference to
local statistics
Spatial statistics in practice
Center for Tropical Ecology and Biodiversity,
Tunghai University & Fushan Botanical Garden
Topics for today’s lecture
• Multivariate grouping, and location-allocation
modeling.
• Going from the global to the local: variability and
heterogeneity.
• Impacts of spatial autocorrelation on histograms.
• The LISA and Getis-Ord statistics.
• Cluster analysis: multivariate analysis, cluster
detection, and spider diagrams.
– An overview of geographic and space-time
clusters.
• Regression diagnostics and geographic clusters
Multivariate grouping goals
• If groups are unknown, to identify the
latent natural groups of areal units
• If groups are known, to assess similarities
and differences among the groups
• To determine the group centroids and
groups of geographical points that result
from minimizing some function of standard
distance
Conventional cluster analysis
distances to minimize
• Single linkage – distances are measured
between pairs of closest (nearest
neighbor) areal units, one from each of
two clusters, in attribute space
• Complete linkage – distances are
measured between pairs of most distant
(furthest neighbor) areal units, one from
each of two clusters, in attribute space
– This criterion often gives the best grouping
results
• Average linkage – distances are
measured between all possible pairs of
areal units, one from each of two clusters,
in attribute space, and then averaged
• Centroid method – squared distances are
measured between each areal unit and all
cluster means, in attribute space
• Ward’s algorithm – based upon ANOVA,
areal units are allocated to clusters in
order to minimize within cluster variances,
and maximize between cluster variance
– This criterion relates to location-allocation
Contemporary cluster analysis criteria
• One- or two-stage density – areal unit groupings are based
upon nonparametric probability density estimation (kth nearest
neighbor, uniform kernel, Wong’s hybrid); utilizes single linkage
• EML (equal variance maximum likelihood) – areal unit
groupings are based upon maximizing the likelihood of mixtures
of identical spherical multivariate normal distributions, possibly
with unequal mixing proportions (i.e., sampling probabilities)
• Flexible-beta – areal unit groupings are based upon a
weighting involving scalar beta, which usually falls between 0
and -1 (a common default value is -0.25, with -0.5 appearing to
be more suitable for data with many outliers)
• McQuitty’s method – areal unit groupings are based upon
weighted average linkage, the weighted pair-group arithmetic
averages
• Gower's median method – areal unit groupings are based
upon weighted pair-group centroids, where distance may or
may not be squared
Ward’s algorithm and locationallocation
P
n
LA MIN :  λ ijw i (u i  U j ) 2  (v i  Vj ) 2
j1 i 1
where w i is the weight for location i
(u i , v i ) is the coordinate for location i
(U j , Vj ) is the coordinate for centroid j
λ ij  0/1, depending upon wheth er or not
location i is allocated to cluster j
Ward' s algorithm
cluster th e u and v variables (standard distance criterion)
with a _FREQ_ variable (weights)
Clustering with PCA/FA
• Although PCA & FA are used most
frequently to deal with multicollinearity
across attribute variables (R-mode), these
techniques also can be used to handle
redundant information across areal units
(Q-mode; e.g., the eigenfunctions of
geographic weight matrix C)
• Linear combinations extracted from matrix
(I-11T/n)C(I-11T/n) or (I-11T/n)D*(I-11T/n)
identify the range of possible distinct map
patterns (i.e., uncorrelated and orthogonal)
Legendre et al. method
• A comparison of the two procedures is in
• Links directly to the semivariogram plot
• D* is a truncated distance-based matrix,
where the truncation is determined by the
length of a minimum spanning tree
articulating the set of locations
n
Ej is the map pattern with spatial
MC j  T
λ j autocorrelation level MC
j
1 C1
Properties
• The extreme eigenvalues define MCmax
and MCmin (not necessarily 1, -1)
• As eigenvalues go from the largest
positive to the largest negative value, map
patterns become more fragmented
• Positive eigenvalues denote:
– Global trends with relatively large values
– Regional trends with intermediate values
– Local trends with relatively small values
Selected ideal map patterns
global
MC ~ 1
regional
MC = 0.9
regional
MC = 0.5
MC = 0.7
local
MC = 0.25
MC = -0.6
SA impacts on Gaussian RVs
Principal impact: variance inflation
SA map pattern:
MC = 1.12, GR = 0.08
standard
normal curves
MC = 1.00
GR = 0.18
MC = 0.00
GR = 1.00
MCmax = 1.18
increased
kurtosis
MC = 0.28
GR = 0.77
heavier
tails
Unstandardized normal curve
autoregressive
generated
Kurtosis increases from 0.01
(roughly 0) to 0.73. The variance
of kurtosis is 24/n.
Therefore, here spatial
autocorrelation has induced
increased relative peakedness
(from the sign of the kurtosis
statistic) whose z = 7.3.
map pattern
generated
Kurtosis increases from 0.04
(roughly 0) to 2.79. The variance
of kurtosis is 24/n.
Therefore, here spatial
autocorrelation has induced
increased relative peakedness
(from the sign of the kurtosis
statistic) whose z = 27.8.
Typical case: MC/MCmax = 0.6
map pattern
MC = 0.61
GR = 0.50
map pattern
MC = 0.80
GR = 0.34
attribute correlations
E3
0.004
E4
0.002
0
X
E3
E(MC) = -0.00042
E(GR) = 1
Transformations to normal
approximations
Torturing the data – conforming to a bellshaped curve
1. Box-Cox power transformations
Y*  (Y  δ) , γ  0
Y *  LN(Y  δ), γ  0
γ
2. Manly’s exponential transformation
Y*  e
γY
3. Percentage adjustments (also arcsine)
 (Y  a)/(T  b)

LN 
 δ
1  (Y  a)/(T  b)

China data example: births/females
Y*  (B/F15-44  0.04)
0.43
China data example: pop/area
Y*  LN(P/A  279)
China data example: births/deaths
Y*  e
0.24B/D
A China example: % F15-44
 (Y  a)/(T  b)

LN 
 δ
1  (Y  a)/(T  b)

empirical probability
mean
min
median
max
F/P
0.247
0.193
0.247
0.416
(F+a)/(P+b)
0.270
0.216
0.257
0.407
(1-c)(F+a)/(P+b)+c
0.168
0.107
0.153
0.324
Constant variance
• Attribute: variable transformations often
stabilize the variance of a variable across
its measurement range
• Mean/median split gives a heuristic
assessment of constant variance (equal
variability of high and low values)
Constant variance
• Geographic: variable transformations often
stabilize the variance of a variable across the
geographic landscape over which it is distributed
• Quadrants of the plane/established areal unit
groupings give a heuristic assessment of
constant variance across a geographic
landscape
Plane quadrants
provinces
Non-normal random variables (RVs)
• Poisson: the mean equals the variance
(built-in heterogeneity)
– overdispersion: the variance is greater than
the mean
– assuming a gamma-distributed mean results
in a negative binomial random variable
• binomial: variance equals (1-p) times the
mean [i.e., Np(1-p)]
– overdispersion: the variance is greater than
Np(1-p)
– employ a quasi-likelihood estimation
Spatial autocorrelation impacts on
Poisson RVs
overdispersion
occurs when:
var(Y) > μ
μ5
σ  5  2.2361
weak positive
spatial autocorrelation
iid
x  4.9560
s  2.2512
x  4.9930
s  2.5874
x  5.0045
s  6.9475
strong positive
spatial autocorrelation
Impacts of typical spatial autocorrelation levels
x  4.9914
s  3.1007
hexagonal
tessellation
x  4.9875
s  4.0098
Poissonness
plots
irregular
tessellation
Spatial autocorrelation impacts
on binomial RVs
global
autoregressive
global
&
regional
• variance increases
• shape goes to
uniform, then to
sinusoidal
global
&
regional
&
local
Going from the global to the local
Paralleling statistics concerning data outliers, and
leverage and influential points, spatial heterogeneity
in georeferenced data is addressed by focusing on
individual areal units. The emphasis shift is from
global trends to local exceptions, to better
understand local deviations from global model
descriptions by exploiting tensions between global
trends and informative local details latent in empirical
data:
• adaptation of conventional diagnostic statistics
(e.g., Unwin and Wrigley, 1987)
• spatializing existing statistical techniques (e.g.,
Fotheringham et al., 2002)
• Anselin’s (1995) seminal paper about indices of
spatial association (i.e., LISA statistics)
• Getis and Ord’s (1992, 1995) Gi and Gi* statistics
Goals of global versus local analysis
• Identify clustering
• Identify particular clusters (significant
local clusters in the absence of global
autocorrelation)
• distinguish between homogeneity and
heterogeneity (e.g., spatial outliers - highs
surrounded by lows, and vice versa)
• identify hot/cold spots
• analyze local instability (local deviations
from global pattern of spatial autocorrelation)
LISA: local indicators of spatial
autocorrelation
n
MC 
n
n
n
 c (y
i 1 j1
n
ij
i
 y)(y j  y)

n
 cij
 (y i  y)
i 1 j1
n
2
i 1
1
n
z c z
i 1
n
 cij
i 1 j1
n
Moran scatterplo t :  c ijz Y, j versus z Y,i
j1
selevation
n
n i   c ij
j1
n
z Y,i  cijz Y, j
LISA :
j1
n
area competition
(n  1)/n
ni
E(LISA) 
; VAR(LISA) from randomizat ion
ni 1
Y,i
j1
ij Y, j
(n  1)/n
Goal: to assess spatial correlation
heterogeneity
color
ANOVA
F = 2845
Pr(>X2) = 0.4
# counties
z-score range
dark
green
Light
green
gray
Light
red
Dark
red
16
2016
200
145
14
-3.6 – 1.2
-1.2 – 3.25
3.25 – 6.4
6.4 – 9.6
9.6+
LISA z-score
The randomization perspective
Conditional randomization
Step1 : hold z Y,i constant
Step 2 : randomly select n i of the remaining (n - 1) z - scores R times
n
Step 3 : compute I r  z Y,i  cijz Y, j , and then the variance (s I ) for the R replicatio ns
j1
n
Step 4 : compute z i 
z Y,i  c ijz Y, j  n i /(n i  1)
j1
sI
Step 5 : compare the probabilit y of z i with Bonferroni adjusted (α /n) and/or
Sidak adjusted [1 - (1 - α)1/n ] multiple test probabilit ies
LISA for PR
LN(elevation + 17.5)
Pr(LISA)
Bonferroni
Sidak
slope is
unstandardized
MC
MC = 0.51; GR = 0.49
LISA maps
significant LISA
• Cannot
distinguish
between H-H
and L-L
clusters
• Conventional
clustering
fails to
preserve
contiguity
… with contiguity proclivity
Clustering geocoding coordinate
pair coupled with zLISA values
Clustering geocoding coordinate pair with
frequencies proportional to zLISA values
Getis-Ord Gi [ Gi* includes i (i.e., j = i)]
• contiguity based upon distance band defined by dr
• dr may be obtained from a semivariogram plot
• one statistic for each areal unit
n
 c (d )y
j1
G i (d r ) 
ij
r
n
j
 y (i)  cij (d r )
j1
n

(n  1) c (d r )   cij (d r )
j1
 j1

n2
n
2
, ji
2
ij
s (i)
• Gi(dr) > 0 signifies clustering of high values
• Gi(dr) < 0 signifies clustering of lows values
• LISA fails to make this particular distinction
A Gi-based analysis: complete linkage
Gi
Gi clusters
A relationship between LISA and Gi for the
same geographic connectivity matrix C
The quadratic trend is why LISA cannot
distinguish between HH and LL clusters,
while Gi can.
geographic & space-time clusters:
an overview
• Global cluster tests search for spatial clusters
anywhere in a study area but do not necessarily
identify where the clusters occur, and are used to
identify departures from spatial randomness
when overall spatial pattern is considered.
• Local cluster tests identify locations at which
there is some excess/deficit—a hot/cold spot—
anywhere within a study area.
• Focused cluster tests determine whether there
is an excess near a pre-specified location, called
a focus, and are used to detect clustering near,
say, putative hazards (e.g., a toxic waste dump).
Cluster detection techniques
Spider diagrams
• allocation to
AAR centroids
• allocation to
cluster (U, V,
z-LISA)
centroids
Regression diagnostics: each observation’s influence
on parameter estimates and predicted values
• PRESS – global measure that should roughly equal
the mean squared error (MSE) for a trend line
(equivalent to cross-validation)
• Leverage – measures degree of influence of areal
unit CzY,i value on an MC trend line (marked: > 2/n)
• Studentized residual – measures whether ith areal
unit causes a significant shift in its corresponding
regression intercept (i.e., is an outlier; marked: > 2)
• Cook’s D – measures influence of ith areal unit on
an MC estimate (analogous to DFFITS;
1
marked: > 2 n
)
Moran scatterplot for LN(elevation + 17.5)
RMSE  1.94759
PRESS/n  1.97055
marked values
mean of
C1
Barranquitas is a spatial
outlier, again!
Spatial autocorrelation in diagnostic
statistics: eigenvector covariates
MC(E2) = 1.04926
Dark red: very high
Light red: high
Gray: medium
Light green: low
Dark green: very low
α  0.01; * α adj  0.0014 zLISA
2
H
rstudent
DFFITS
2
2*
2
4*
6
13
23
25
25
69
MC = 1.04926
R2 0.439
0.274 0.303
0.109
Related documents
Download