Document 11863922

advertisement
This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
Further Explorations of Relationships
between Semi-Variogram and Spatial
Autoregressive Models
Daniel A. Griffith, Larry J. Layne, and Philip G. Doyle2
Abstract.-This paper continues the forging of a common foundation for geo- and
spatial statistics. To date partial conceptual correspondence has been established
between the conditional autoregressive (CAR) and exponential semi-variogram
models, the simultaneous autoregressive (SAR) and Bessel hnction semi-variogram
models, and the moving average (MA) and linear semi-variogram models, when
directional bias is not present. The exploratory numerical work summarized in this
paper addresses the issue of whether or not these articulations are preserved when
directional bias is present in the latent spatial autocorrelation, given symmetric spatial
dependency. In doing so, impacts of edge effects are examined.
INTRODUCTION
Generally speaking, spatial statistics is concerned with the statistical analysis of georeferenced (or locationally tagged) data, and seeks to exploit the redundant information
represented by spatial autocorrelation latent in sum data. Two principal thrusts of spatial
analysis have been (1) interpolation supporting visualization of surfaces, and (2) data
analysis supporting statistical description and inference. One of the seminal visualization
efforts was the famous SYMAP computer program, which sought to furnish low-resolution, coarse visualization of geographic surface representations (contour maps or perspective surfaces) based upon measurements made at a relatively small number of dispersed
locations. Paralleling the development and dissemination of SYMAP was a h d y of work
that focused on exploiting more localized spatial dependence latent in geo-referenced data,
which has become known as the theory of regionalized variables, or geo-statistics.
Matheron initiated this pioneering work in the 1960s, and Cressie (1991) recently published a state-of-the-art overview of geostatistics. Meanwhile, spatial statistics research,
especially characterized by Cliff and Ord's (e.g., 198 1) work, focusedattention on the need
to employ spatially adjusted linear statistical models. Thus, while both geo- and spatial
' ~ h i sresearch was supported by the National Science Foundation, research grant # SBR-9507855.
'~rofessor,Research Associate, and undergraduate research assistant (through an NSF REU supplement
grant), respectively, Department of Geography, Syracuse University, 144 Eggers Hall, Syracuse,
NY 13244-1090.
statistics support the visualization of geo-referenced data, spatial statistics goes beyond this
task by enabling the drawing of sound statistical inferences from geo-referenced data.
While Davis and McCullagh (1975) began forging a common foundation for these two subdisciplines, such a connection has continued to remain elusive. The prncipal objective of
the research findings reported in this paper is to pursue this end. its practical importance
partly arises within the U.S. E.P.A. Environmental Monitoring and Assessment Program
(EMAP) project, a major national endeavor in which statistics is being appliedto environmental issues as well as monitoring and analysis undertakings, and partly in remediation
work dealing with environmental contaminants. Concepts from both geo- and spatial
statistics are necessary for successfbl engagement in these two scimtific pursuits. Articulation of a common foundation for geo- and spatial statistics will better enable consistent
statistical application and analysis of the massiveamounts of geo-referenced environmental
data becoming available. By doing so, geographic visualization and analysis will be able
to more closely go hand in hand, and interpolated maps will be better able to be analyzed
utilizing consistent spatial statistical model specifications.
Communalities Between Geo- and Spatial Statistics
Historically, traditional statistical analyses have been concerned with datareduction and
summarization. In contrast, time series analysis frequently leads to forecasting, whereas
spatial series analysis often leads to interpolation, two data explosion operations. Geo- and
spatial statistics similarly fixate on this interpolation issue through the "missing data"
problem. Martin (1984) has outlined the exzt maximum likelihood estimates of missing
values for incomplete geographic data sets from a spatial statistics, autoregressive modelling perspective. The resulting equations, as demonstrated in Griffith (1993), are exactly
those best linear unbiased estimates obtained with kriging (see Christensen, 1991), but in
terms of the inverse covariance matrix rather than the covariance matrix itself. This link
initially provided a conspicuous piece of evidence suggestingclose relationships between
semi-variogram and spatial autoregressive models, and furnished the motivation for
Griflith and Csillag (1993) to numerically explore this relationship.
Algebraically speaking, then, estimation of missing data in spatial statistics (in order to
avoid maps with holes) and kriging as an interpolatorlpredictor are exactly equivalent,
which connects kriging to the E-M algorithm literature of statistics. Both numerical and
graphical evidence exist to indicate that the spatial statistical conditional autoregressive
(CAR) and geo-statistical exponential semi-variogram models are linked. Graphical
evidence exists that couples the spatial statistical moving average (MA) and geo-statistical
linear semi-variogram models. And, both numerical evidence and theoretical arguments
signal a relationship between the spatial statistical simultaneous autoregressive ( S A R ) and
the geo-statistical Bessel function (first order, second kind) semi-variogram models. The
spectral density function of the SAR model may be written as:
Christensen (1991) summarizes the status of thk model by noting that Whittle (1954) has
shown that the Bessel function can arise naturally in two-dimensional covariations,and the
exponential semi-variogram model is a special case of it (Ripley 1981). The relevant
functional form, where K, is a modified Bessel function, is given by:
When considering the appropriateness of this latter function as a descriptor of geographic
covariation, numerical evidence (computed with Heuvelink's weighted least squares
software) based on an isotropic dependency structure indicates that the two critical features
leading to a Bessel function semi-variogram model specification appear to be intensity and
degree of articulation of a geographic structure (see Table 1).
Table 1. Bessel function semi-variogram fit of autoregressive models for geographic structures.
Geographic
Autoregressive model
articulation
conditional
simultaneous
CO
c,
range SSDISST
CO
c,
range
SSDISST
square
0.350 0.617
2.56
0.007
0.005 0.994
8.27
0.000
--------0.041
----- ----0.443 0.489
2.19
hexagon
0.011 0.988
queen
0.442 0.520
2.88
0.004
10.1
0.000
Of note is that the exponential semi-variogram model (which actually is a Bessel function
of order %) furnishes a better description of the CAR cases than does the particular
modified Bessel function estimated here. An SAR structure, which involves secondarder
dependencies, is better characterized by the popular Bessel function than is the CAR
structure, which involves only first-order dependencies. Furthermore, relying upon an
analogy with chess, the spatial linkage structure for the queen's set of neighbors (those
areal units sharing any common boundary point) is better characterized by this Bessel
fbnction than is the rook's set of neighbors (only areal units sharing a common boundary
of non-zero length). Moreover, as the spatial random field increases in scope (i.e., moving
fiom a rook's to a queen's move dependency field), the SAR model more strongly relates
to the Bessel function semi-variogram model. That is, the discrete setting more closely
resembles the continuous setting. This tendency holds important implications 6 r the U.S.
E.P.A. EMAP database currently being constructed, which is being geo-referenced in
terms of a hexagonal tessellation (comprised of approximately 12,600 hexagons) superimposed upon the continental U.S.
Numerical evidence presented in Table 1 was taken from exploratory computer experiments involving the following three general steps:
calculation of theoretical spatial correlations at various lags, for a regular square
tessellation, based upon spatial autoregressive model specifications
Step 2: fitting of selected geo-statistical semi-variogram models to the two-dimensional
correlograms obtained in Step 1 using non-linear least squares
Step 3: graphical superimposition of the estimated weighted least squares curves on the
correlograms obtained in Step 1 using Heuvelink's computer s o h a r e or customized SAS code.
Step I :
The findings are restricted to translations fiom spatial statistical to geo-statistical model
specifications, spotlighting the three overwhelmingly most popular spatial autoregressive
models (i.e., the MA, CAR and SAR models).
Fortifying the Autoregressive-geo-statistical Linkages
Research findings summarized here pertain to a fbrther articulationof a common foundation for geo- and spatial statistics. One motivation for pursuing this work is the need for
establishing a common umbrella for a growing body of seemingly disparate work (e.g.,
Rey, Getis and Bortman, 1994; Griffith, 1992; Little and Rubin, 1987). The extension
outlined here builds on results reported by Grifith (1993) and Griffith andcsillag (1993).
Two principal themes have been examined. One concerns non-stationary, multi-parameter
spatial autoregressive models. Martin (1990) notes that the use of even bi-parametric
spatial autoregressive inverse-covariancematrix specificationsis extremely rare in applications, although well-established for very special cases. This avoidance of the two-parameter model in part is due to numerical difficulties affiliated with its estimation. The critical
simplicity feature of a specification is that the pair of geographic weights matrices either
are commutative or can exploit the matrix algebra notion of similarity. In contrast, specification and estimation of directional semi-variogram functions is quite common.
The second theme concerns boundary, or edge, effects. Griffith (1983) addresses the
general problem of edge effects in quantitative geographical analyses, underscoring their
seriousness. Rathbun (1994) provides an excellent example of the type of distortion
overlooked edge effects can create and showed that properly handling them allowed better
interpolation of salinity in Chesapeake Bay. In addition, results reported in Griffith and
Csillag (1993) exhibit an affinity between the magnitudeof prevailing spatial dependencies
and the dominance of regional edge effects. This factor emerges baause semi-variogram
models directly deal with the covariance matrix for a set of observations, whereas spatial
autoregressive models deal with the inverse of this matrix, and the procedure of matrix
inversion is what amplifies and propagates edge effects. To date these effats have defied
the formulation of general correction techniques. Insight into this edge effects phenomenon is gained here by holding the degree of spatial dependence constant, and changingthe
size of the region within which estimation is conducted (see Table 2).
Table 2. Bessel function parameter estimates for queen's case, isotropic SAR specification,
Maximum lag
5
Co
CI
0.0093
0.906
Theoretically one would expect c,
=
0 and c,
range
9.35
=
relative SS
0.000
1. Notice that as the region used for
parameter estimation purposes increases in size, the fit of the Bessel function remains nearperfect, while the parameter estimates themselves change. The range of the spatial
autocorrelation field increases to some upper limit and c, approaches 1. Somewhat
surprising is that c, (the nugget effect) does not move toward zero, which may be indicative of a liability associated with using a non-linear model specification and/or weighted
least squares estimation.
Two-parameter Autoregressive Functions and Directional Semi-variogram
Models
Theoretical correlograms can be constructed for the bi-parametric spatial autoregressive
model matched with a regular square tessellation for the rook's set of neighbors spatial
linkages since the closed form spectral density hnction is known. As noted previously,
specification and estimation of directional semi-variogram functions is common. The
theoretical two-dimensional correlograms generated by these equations do not need to be
rotated, as their major and minor axes coincide with the corresponding horizontal and
vertical axes.
Results of bi-parametric semi-variogram model fits are reported in Table 3. As has been
found for the isotropic case, the exponential semi-variogrammodel relates almost perfectly
to the spatial statistical CAR model case, and the Bessel function semi-variogram model
relates almost perfectly to the spatial statistical SARmodel case. For all cases explored
here, the relative residual sums of squares essentially equals zero. This indicates that the
linkages already established between the geo-statistical and spatial statistical models
persist in the presence of anisotropy. Differences in the computed directional ranges are
consistent with differences between the directional autoregressive parameters p, and p,.
Furthermore, an inspection of the change in semi-variogram parameter estimates as the
size of the region under study increases reveals the presence of edge effects,as it did in the
isotropic situation. These edge effects become more pronounced as the two levels of
spatial autocorrelation increase.
Table 3. Parameter estimates and relative sums of squares (error sums of squares/corrected total
sums of squares, or SSEICTSS) for the CAR and SAR models for 4 different lattice sizes.
Autoregressive model
conditional
simultaneous
lattice
size
parameter
4 x 4 C,
C1
bI
b2
6x6
SSEICTSS
C,
CI
b1
PI = 0.05 p,=0.175 pl=0.050
p, = 0.25 p, = 0.320 p, = 0.445
0,00000
0.13062
0.05397
1.00045
0.35341
0.76714
0.00328
0.00000
1.00020
0.35265
0.79249
1.52411
2.13413
0.00338
0.18844
0.76710
I.80853
0.89910
0.95340
3.08842
0.00328
0.09685
0.87825
1.07463
pl=0.05
p, = 0.25
0.02864
0.97118
0.35638
0.84330
0.00225
0.02906
0.97090
0.35662
p,=0.175 p,=0.050
p, = 0.320 p, = 0.445
0.00922
0.00645
0.93289
0.92101
3.84913
1.93552
5.25147
6.14246
0.00013
0.00059
0.01169
0.01215
0.96549
0.96173
4.00823
2.07320
8x8
b2
SSEICTSS
C,
c1
b,
b2
SSEICTSS
10 x 10 C,
cI
b,
b2
SSEICTSS
0.00298
0.00594
0.00804
1
0.00193
0.00003
0.00027
Theoretically the parameter estimates for C, and C, theoretically should equal 0 and
1, respectively. These are approximately the values estimated involving low levels of
spatial autocorrelation, for both the exponential and Bessel function semi-variogram
models. A tendency for C,to move toward 1 is apparent for the intermediate and high
levels of spatial autocorrelation in both model cases. Likewise a movement toward 0 is
detectable for C, in the exponential case but not in the Bessel function case.
Empirical Case Studies: The Mercer-Hall Data and Pediatric Blood Lead Levels
in Syracuse
Two empirical data sets are investigated here. The first is the Mercer-Hall agricultural
wheat-yield field plot data. These data were collected for 500 plots that formed a regular
square 25-by-20 tessellation. Cressie (199 1, p . 454) summarizes these data and their
geographic landscape features. The second is pediatric blood lead levels in the City of
Syracuse measured during the two-year period 411192-313 1/94. This data set is comprised of 7,158 measurements (out of 15,277 cases for the County), some of which are
replicates, spread across 3,841 locations. Twelve children had measures of zero, and
for convenience were dropped from the analysis. Locational tags for these measures
were obtained by address matching with ARCIINFO using TIGER files for Onondaga
County, NY (with about an 80% success rate).
The Mercer-Hall Data
These data appear to be approximately normally distributed [P(Wilk-Shapiro statistic
< 0.98041) = 0.08581. Besag (1974) fit a bi-parametric constant mean CAR model to
these data, using a normalizing approximation proposed by Whittle (1954), and obtained
the parameter estimates p, = 0.368 and p, = 0.107. Exact maximum likelihood estimation (using IMSL subroutine UMINF) yields the estimates p , = 0.36445 and p , =
0.1 1388, while using an extension of the Jacobian approximation outlined by Griffith and
Sone (1995) implemented in SAS yields the estimates p, = 0.37634 and p , = 0.12480.
Both of these latter solutions began iterating from the OLS values of p , = 0.34045 and
p, = 0.13746. A mapping of the two-dimensional correlogram corroborated the need
for an anisotropic spatial statistical model specification, one that is symmetric and biparametric but not in need of an axis rotation. Restricting estimation of the exponential
semi-variogram to 10 lags, and using weighted least squares and a model specification
version following Haining (1990, p. 285), rendered the parameter estimates
CO
0.350
CB
0.617
range #1
3.37411
range # 2
0.74487
SSD/SST
0.36600
The semi-variogram plot for these data, upon which the exponential model predictiom are
superimposed, is indicative of both a CAR and a bi-parametric model specification.
The Syracuse Pediatric Blood Lead Level Data
A log-transformed version of these data is closer to being approximately normally
distributed [P(Wilk-Shapiro statistic < 0.9895) < 0.011 than are the raw data themselves,
and they also deviate far less fiom the homogeneous variance case. Because of the
massive number of pairs of directional distances that can be computed for this dataset (i.e.,
14,753,281) attention has been restricted here to distances less han or equal to 0.2 krn. A
two-dimensional correlogram for this range of distances indicates the need for a bi-parametric model specification involving an axis rotation. Results of such an exercise are
contained in Table 4. The rotation angle, 8, corroborates the need for an axis rotation in
Table 4. Semi-variogram estimation for Syracuse blood lead levels.
CO
C,
range #1
range #2
relative SS
Model
0.20396 0.58339 0.00192
0.37078
CAR
0.01 154
0.20832
0.57789 0.00070
SAR
0.04549
0.36326
8
0.04518 x
0.03979 n;
this empirical case. The range estimates corroborate the need for a bi-parametric model
specification. Also, the SAR model furnishes a marginally better semi-variogram model
fit.
CONCLUSIONS
Additional numerical and empirical evidence is presented in this paper supporting the
establishment of links between the CAR and exponential semi-variogam models, and the
SAR and Bessel function semi-variogram models. This new evidence reveals that these
links persist in the presence of anisotropy. Moreover, these articulations are preserved
when directional bias is present in latent spatial autocorrelation, given directionally
symmetric spatial dependency.
REFERENCES
Besag, J. 1974. Spatial interaction and the statistical analysis of lattke systems. Journal of
the Royal Statistical Society, 36: 192-225.
Christensen, R. 199 1. Linear Models for Multivariate, Time Series, and Spatial Data. NY:
Springer-Verlag.
Cliff, A., and J. Ord. 1981. Spatial Processes. London: Pion.
Cressie, N. 1991. Statistics for Spatial Data. NY: Wiley.
Davis, J., and M. McCullagh (eds.). 1975. Display and Analysis of Spatial Data. NY:
Wiley .
Griffith, D. 1983. The boundary value problem in spatial statistical analysis. Journal of
Regional Science, 23:3 77-387.
Griffith, D. 1992. Estimating missing values in spatial urban census data. Tie Operational
Geographer, 10 (2): 23-26.
Griffith, D. 1993. Advanced spatial statistics for analyzing and visualizinggeo-referenced
Data. International Journal of Geographical Information Systems, 7: 107-123.
Griffith, D. and F. Csillag. 1993. Exploring relationships between semi-variogram and
spatial autoregressive models. Papers in Regional Science, 72: 283-296.
Griffith, D., and A. Sone. 1995 Trade-offs associated with normalizing constant complta
tional simplifications for estimating spatial statistical models. Journal of Statistical
Computation and Simulation, 5 1: 165-183.
Haining, R. 1990. Spatial Data Analysis in the Social and Environmental Sciences.
Cambridge: Cambridge University Press.
Heuvelink, G. 1993. Error Propagation in Quantitative Spatial Modelling. Utrecht, The
Netherlands: Faculteit Ruimtelijke Wetenschappen Universiteit Utrecht.
Little, R., and D. Rubin. 1987. Statistical Analysis with Missing Data. NY: Wiley.
Martin, R. 1984. Exact maximum likelihood for incomplete data fiom a correlated
Gaussian process. Communications in Statistics: Theov and Methods, 13 : 1275-1288.
Martin, R. 1990. Spatial statistical processes in geographic modellingin Spatial Statistics:
Past, Present, and Future, edited by D. Griffith, pp. 109-127. Ann Arbor, MI: IMaGe.
Rathbun, S. 1994. Spatial modelling in irregularly shaped regions: kriging estuaries. paper
presented at the annual joint statistical meetings,August 13-18, Toronto, Ontario,
Canada.
Rey, S., A. Getis, and A. Bcrtman. 1994. Spatial modeling approaches for the estimation
of suppressed geo-referenced data. paper presentedat the 4 1st North American meetings
of the Regional Science Association International, Niagara Falls, Canada, November 1720.
Ripley, B. 1981. Spatial Statistics. NY: Wiley.
Whittle, P. 1954. On stationary process in the plane. Biometrika, 41: 434-449.
BIOGRAPHICAL SKETCH
Dr. Daniel A. Grifith is currently Chair, Department of Geogramy at Syracuse University. He holds a Ph.D. in Geography from the University of Toronto (1978), an M.S. in
Statistics from The Pennsylvania State University (1985), an M.A. in Geography (1972)
and a B.S. in Mathematics, both fiom Indiana University of Pennsylvania.
Lany J. Layne is a Ph.D. candidate at the College of Environmental Sciences and
Forestry, S.U.N.Y. He holds an M.S. in Wildlife and Fisheries from South Dakota State
University (1988) and a B.S. in Wildlife from Humboldt State University (1983).
Philip G. Doyle is a senior in the Department of Geogramy at Syracuse University and
a G.I.S. Intern for Niagara Mohawk Power Corporation.
Download