This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Further Explorations of Relationships between Semi-Variogram and Spatial Autoregressive Models Daniel A. Griffith, Larry J. Layne, and Philip G. Doyle2 Abstract.-This paper continues the forging of a common foundation for geo- and spatial statistics. To date partial conceptual correspondence has been established between the conditional autoregressive (CAR) and exponential semi-variogram models, the simultaneous autoregressive (SAR) and Bessel hnction semi-variogram models, and the moving average (MA) and linear semi-variogram models, when directional bias is not present. The exploratory numerical work summarized in this paper addresses the issue of whether or not these articulations are preserved when directional bias is present in the latent spatial autocorrelation, given symmetric spatial dependency. In doing so, impacts of edge effects are examined. INTRODUCTION Generally speaking, spatial statistics is concerned with the statistical analysis of georeferenced (or locationally tagged) data, and seeks to exploit the redundant information represented by spatial autocorrelation latent in sum data. Two principal thrusts of spatial analysis have been (1) interpolation supporting visualization of surfaces, and (2) data analysis supporting statistical description and inference. One of the seminal visualization efforts was the famous SYMAP computer program, which sought to furnish low-resolution, coarse visualization of geographic surface representations (contour maps or perspective surfaces) based upon measurements made at a relatively small number of dispersed locations. Paralleling the development and dissemination of SYMAP was a h d y of work that focused on exploiting more localized spatial dependence latent in geo-referenced data, which has become known as the theory of regionalized variables, or geo-statistics. Matheron initiated this pioneering work in the 1960s, and Cressie (1991) recently published a state-of-the-art overview of geostatistics. Meanwhile, spatial statistics research, especially characterized by Cliff and Ord's (e.g., 198 1) work, focusedattention on the need to employ spatially adjusted linear statistical models. Thus, while both geo- and spatial ' ~ h i sresearch was supported by the National Science Foundation, research grant # SBR-9507855. '~rofessor,Research Associate, and undergraduate research assistant (through an NSF REU supplement grant), respectively, Department of Geography, Syracuse University, 144 Eggers Hall, Syracuse, NY 13244-1090. statistics support the visualization of geo-referenced data, spatial statistics goes beyond this task by enabling the drawing of sound statistical inferences from geo-referenced data. While Davis and McCullagh (1975) began forging a common foundation for these two subdisciplines, such a connection has continued to remain elusive. The prncipal objective of the research findings reported in this paper is to pursue this end. its practical importance partly arises within the U.S. E.P.A. Environmental Monitoring and Assessment Program (EMAP) project, a major national endeavor in which statistics is being appliedto environmental issues as well as monitoring and analysis undertakings, and partly in remediation work dealing with environmental contaminants. Concepts from both geo- and spatial statistics are necessary for successfbl engagement in these two scimtific pursuits. Articulation of a common foundation for geo- and spatial statistics will better enable consistent statistical application and analysis of the massiveamounts of geo-referenced environmental data becoming available. By doing so, geographic visualization and analysis will be able to more closely go hand in hand, and interpolated maps will be better able to be analyzed utilizing consistent spatial statistical model specifications. Communalities Between Geo- and Spatial Statistics Historically, traditional statistical analyses have been concerned with datareduction and summarization. In contrast, time series analysis frequently leads to forecasting, whereas spatial series analysis often leads to interpolation, two data explosion operations. Geo- and spatial statistics similarly fixate on this interpolation issue through the "missing data" problem. Martin (1984) has outlined the exzt maximum likelihood estimates of missing values for incomplete geographic data sets from a spatial statistics, autoregressive modelling perspective. The resulting equations, as demonstrated in Griffith (1993), are exactly those best linear unbiased estimates obtained with kriging (see Christensen, 1991), but in terms of the inverse covariance matrix rather than the covariance matrix itself. This link initially provided a conspicuous piece of evidence suggestingclose relationships between semi-variogram and spatial autoregressive models, and furnished the motivation for Griflith and Csillag (1993) to numerically explore this relationship. Algebraically speaking, then, estimation of missing data in spatial statistics (in order to avoid maps with holes) and kriging as an interpolatorlpredictor are exactly equivalent, which connects kriging to the E-M algorithm literature of statistics. Both numerical and graphical evidence exist to indicate that the spatial statistical conditional autoregressive (CAR) and geo-statistical exponential semi-variogram models are linked. Graphical evidence exists that couples the spatial statistical moving average (MA) and geo-statistical linear semi-variogram models. And, both numerical evidence and theoretical arguments signal a relationship between the spatial statistical simultaneous autoregressive ( S A R ) and the geo-statistical Bessel function (first order, second kind) semi-variogram models. The spectral density function of the SAR model may be written as: Christensen (1991) summarizes the status of thk model by noting that Whittle (1954) has shown that the Bessel function can arise naturally in two-dimensional covariations,and the exponential semi-variogram model is a special case of it (Ripley 1981). The relevant functional form, where K, is a modified Bessel function, is given by: When considering the appropriateness of this latter function as a descriptor of geographic covariation, numerical evidence (computed with Heuvelink's weighted least squares software) based on an isotropic dependency structure indicates that the two critical features leading to a Bessel function semi-variogram model specification appear to be intensity and degree of articulation of a geographic structure (see Table 1). Table 1. Bessel function semi-variogram fit of autoregressive models for geographic structures. Geographic Autoregressive model articulation conditional simultaneous CO c, range SSDISST CO c, range SSDISST square 0.350 0.617 2.56 0.007 0.005 0.994 8.27 0.000 --------0.041 ----- ----0.443 0.489 2.19 hexagon 0.011 0.988 queen 0.442 0.520 2.88 0.004 10.1 0.000 Of note is that the exponential semi-variogram model (which actually is a Bessel function of order %) furnishes a better description of the CAR cases than does the particular modified Bessel function estimated here. An SAR structure, which involves secondarder dependencies, is better characterized by the popular Bessel function than is the CAR structure, which involves only first-order dependencies. Furthermore, relying upon an analogy with chess, the spatial linkage structure for the queen's set of neighbors (those areal units sharing any common boundary point) is better characterized by this Bessel fbnction than is the rook's set of neighbors (only areal units sharing a common boundary of non-zero length). Moreover, as the spatial random field increases in scope (i.e., moving fiom a rook's to a queen's move dependency field), the SAR model more strongly relates to the Bessel function semi-variogram model. That is, the discrete setting more closely resembles the continuous setting. This tendency holds important implications 6 r the U.S. E.P.A. EMAP database currently being constructed, which is being geo-referenced in terms of a hexagonal tessellation (comprised of approximately 12,600 hexagons) superimposed upon the continental U.S. Numerical evidence presented in Table 1 was taken from exploratory computer experiments involving the following three general steps: calculation of theoretical spatial correlations at various lags, for a regular square tessellation, based upon spatial autoregressive model specifications Step 2: fitting of selected geo-statistical semi-variogram models to the two-dimensional correlograms obtained in Step 1 using non-linear least squares Step 3: graphical superimposition of the estimated weighted least squares curves on the correlograms obtained in Step 1 using Heuvelink's computer s o h a r e or customized SAS code. Step I : The findings are restricted to translations fiom spatial statistical to geo-statistical model specifications, spotlighting the three overwhelmingly most popular spatial autoregressive models (i.e., the MA, CAR and SAR models). Fortifying the Autoregressive-geo-statistical Linkages Research findings summarized here pertain to a fbrther articulationof a common foundation for geo- and spatial statistics. One motivation for pursuing this work is the need for establishing a common umbrella for a growing body of seemingly disparate work (e.g., Rey, Getis and Bortman, 1994; Griffith, 1992; Little and Rubin, 1987). The extension outlined here builds on results reported by Grifith (1993) and Griffith andcsillag (1993). Two principal themes have been examined. One concerns non-stationary, multi-parameter spatial autoregressive models. Martin (1990) notes that the use of even bi-parametric spatial autoregressive inverse-covariancematrix specificationsis extremely rare in applications, although well-established for very special cases. This avoidance of the two-parameter model in part is due to numerical difficulties affiliated with its estimation. The critical simplicity feature of a specification is that the pair of geographic weights matrices either are commutative or can exploit the matrix algebra notion of similarity. In contrast, specification and estimation of directional semi-variogram functions is quite common. The second theme concerns boundary, or edge, effects. Griffith (1983) addresses the general problem of edge effects in quantitative geographical analyses, underscoring their seriousness. Rathbun (1994) provides an excellent example of the type of distortion overlooked edge effects can create and showed that properly handling them allowed better interpolation of salinity in Chesapeake Bay. In addition, results reported in Griffith and Csillag (1993) exhibit an affinity between the magnitudeof prevailing spatial dependencies and the dominance of regional edge effects. This factor emerges baause semi-variogram models directly deal with the covariance matrix for a set of observations, whereas spatial autoregressive models deal with the inverse of this matrix, and the procedure of matrix inversion is what amplifies and propagates edge effects. To date these effats have defied the formulation of general correction techniques. Insight into this edge effects phenomenon is gained here by holding the degree of spatial dependence constant, and changingthe size of the region within which estimation is conducted (see Table 2). Table 2. Bessel function parameter estimates for queen's case, isotropic SAR specification, Maximum lag 5 Co CI 0.0093 0.906 Theoretically one would expect c, = 0 and c, range 9.35 = relative SS 0.000 1. Notice that as the region used for parameter estimation purposes increases in size, the fit of the Bessel function remains nearperfect, while the parameter estimates themselves change. The range of the spatial autocorrelation field increases to some upper limit and c, approaches 1. Somewhat surprising is that c, (the nugget effect) does not move toward zero, which may be indicative of a liability associated with using a non-linear model specification and/or weighted least squares estimation. Two-parameter Autoregressive Functions and Directional Semi-variogram Models Theoretical correlograms can be constructed for the bi-parametric spatial autoregressive model matched with a regular square tessellation for the rook's set of neighbors spatial linkages since the closed form spectral density hnction is known. As noted previously, specification and estimation of directional semi-variogram functions is common. The theoretical two-dimensional correlograms generated by these equations do not need to be rotated, as their major and minor axes coincide with the corresponding horizontal and vertical axes. Results of bi-parametric semi-variogram model fits are reported in Table 3. As has been found for the isotropic case, the exponential semi-variogrammodel relates almost perfectly to the spatial statistical CAR model case, and the Bessel function semi-variogram model relates almost perfectly to the spatial statistical SARmodel case. For all cases explored here, the relative residual sums of squares essentially equals zero. This indicates that the linkages already established between the geo-statistical and spatial statistical models persist in the presence of anisotropy. Differences in the computed directional ranges are consistent with differences between the directional autoregressive parameters p, and p,. Furthermore, an inspection of the change in semi-variogram parameter estimates as the size of the region under study increases reveals the presence of edge effects,as it did in the isotropic situation. These edge effects become more pronounced as the two levels of spatial autocorrelation increase. Table 3. Parameter estimates and relative sums of squares (error sums of squares/corrected total sums of squares, or SSEICTSS) for the CAR and SAR models for 4 different lattice sizes. Autoregressive model conditional simultaneous lattice size parameter 4 x 4 C, C1 bI b2 6x6 SSEICTSS C, CI b1 PI = 0.05 p,=0.175 pl=0.050 p, = 0.25 p, = 0.320 p, = 0.445 0,00000 0.13062 0.05397 1.00045 0.35341 0.76714 0.00328 0.00000 1.00020 0.35265 0.79249 1.52411 2.13413 0.00338 0.18844 0.76710 I.80853 0.89910 0.95340 3.08842 0.00328 0.09685 0.87825 1.07463 pl=0.05 p, = 0.25 0.02864 0.97118 0.35638 0.84330 0.00225 0.02906 0.97090 0.35662 p,=0.175 p,=0.050 p, = 0.320 p, = 0.445 0.00922 0.00645 0.93289 0.92101 3.84913 1.93552 5.25147 6.14246 0.00013 0.00059 0.01169 0.01215 0.96549 0.96173 4.00823 2.07320 8x8 b2 SSEICTSS C, c1 b, b2 SSEICTSS 10 x 10 C, cI b, b2 SSEICTSS 0.00298 0.00594 0.00804 1 0.00193 0.00003 0.00027 Theoretically the parameter estimates for C, and C, theoretically should equal 0 and 1, respectively. These are approximately the values estimated involving low levels of spatial autocorrelation, for both the exponential and Bessel function semi-variogram models. A tendency for C,to move toward 1 is apparent for the intermediate and high levels of spatial autocorrelation in both model cases. Likewise a movement toward 0 is detectable for C, in the exponential case but not in the Bessel function case. Empirical Case Studies: The Mercer-Hall Data and Pediatric Blood Lead Levels in Syracuse Two empirical data sets are investigated here. The first is the Mercer-Hall agricultural wheat-yield field plot data. These data were collected for 500 plots that formed a regular square 25-by-20 tessellation. Cressie (199 1, p . 454) summarizes these data and their geographic landscape features. The second is pediatric blood lead levels in the City of Syracuse measured during the two-year period 411192-313 1/94. This data set is comprised of 7,158 measurements (out of 15,277 cases for the County), some of which are replicates, spread across 3,841 locations. Twelve children had measures of zero, and for convenience were dropped from the analysis. Locational tags for these measures were obtained by address matching with ARCIINFO using TIGER files for Onondaga County, NY (with about an 80% success rate). The Mercer-Hall Data These data appear to be approximately normally distributed [P(Wilk-Shapiro statistic < 0.98041) = 0.08581. Besag (1974) fit a bi-parametric constant mean CAR model to these data, using a normalizing approximation proposed by Whittle (1954), and obtained the parameter estimates p, = 0.368 and p, = 0.107. Exact maximum likelihood estimation (using IMSL subroutine UMINF) yields the estimates p , = 0.36445 and p , = 0.1 1388, while using an extension of the Jacobian approximation outlined by Griffith and Sone (1995) implemented in SAS yields the estimates p, = 0.37634 and p , = 0.12480. Both of these latter solutions began iterating from the OLS values of p , = 0.34045 and p, = 0.13746. A mapping of the two-dimensional correlogram corroborated the need for an anisotropic spatial statistical model specification, one that is symmetric and biparametric but not in need of an axis rotation. Restricting estimation of the exponential semi-variogram to 10 lags, and using weighted least squares and a model specification version following Haining (1990, p. 285), rendered the parameter estimates CO 0.350 CB 0.617 range #1 3.37411 range # 2 0.74487 SSD/SST 0.36600 The semi-variogram plot for these data, upon which the exponential model predictiom are superimposed, is indicative of both a CAR and a bi-parametric model specification. The Syracuse Pediatric Blood Lead Level Data A log-transformed version of these data is closer to being approximately normally distributed [P(Wilk-Shapiro statistic < 0.9895) < 0.011 than are the raw data themselves, and they also deviate far less fiom the homogeneous variance case. Because of the massive number of pairs of directional distances that can be computed for this dataset (i.e., 14,753,281) attention has been restricted here to distances less han or equal to 0.2 krn. A two-dimensional correlogram for this range of distances indicates the need for a bi-parametric model specification involving an axis rotation. Results of such an exercise are contained in Table 4. The rotation angle, 8, corroborates the need for an axis rotation in Table 4. Semi-variogram estimation for Syracuse blood lead levels. CO C, range #1 range #2 relative SS Model 0.20396 0.58339 0.00192 0.37078 CAR 0.01 154 0.20832 0.57789 0.00070 SAR 0.04549 0.36326 8 0.04518 x 0.03979 n; this empirical case. The range estimates corroborate the need for a bi-parametric model specification. Also, the SAR model furnishes a marginally better semi-variogram model fit. CONCLUSIONS Additional numerical and empirical evidence is presented in this paper supporting the establishment of links between the CAR and exponential semi-variogam models, and the SAR and Bessel function semi-variogram models. This new evidence reveals that these links persist in the presence of anisotropy. Moreover, these articulations are preserved when directional bias is present in latent spatial autocorrelation, given directionally symmetric spatial dependency. REFERENCES Besag, J. 1974. Spatial interaction and the statistical analysis of lattke systems. Journal of the Royal Statistical Society, 36: 192-225. Christensen, R. 199 1. Linear Models for Multivariate, Time Series, and Spatial Data. NY: Springer-Verlag. Cliff, A., and J. Ord. 1981. Spatial Processes. London: Pion. Cressie, N. 1991. Statistics for Spatial Data. NY: Wiley. Davis, J., and M. McCullagh (eds.). 1975. Display and Analysis of Spatial Data. NY: Wiley . Griffith, D. 1983. The boundary value problem in spatial statistical analysis. Journal of Regional Science, 23:3 77-387. Griffith, D. 1992. Estimating missing values in spatial urban census data. Tie Operational Geographer, 10 (2): 23-26. Griffith, D. 1993. Advanced spatial statistics for analyzing and visualizinggeo-referenced Data. International Journal of Geographical Information Systems, 7: 107-123. Griffith, D. and F. Csillag. 1993. Exploring relationships between semi-variogram and spatial autoregressive models. Papers in Regional Science, 72: 283-296. Griffith, D., and A. Sone. 1995 Trade-offs associated with normalizing constant complta tional simplifications for estimating spatial statistical models. Journal of Statistical Computation and Simulation, 5 1: 165-183. Haining, R. 1990. Spatial Data Analysis in the Social and Environmental Sciences. Cambridge: Cambridge University Press. Heuvelink, G. 1993. Error Propagation in Quantitative Spatial Modelling. Utrecht, The Netherlands: Faculteit Ruimtelijke Wetenschappen Universiteit Utrecht. Little, R., and D. Rubin. 1987. Statistical Analysis with Missing Data. NY: Wiley. Martin, R. 1984. Exact maximum likelihood for incomplete data fiom a correlated Gaussian process. Communications in Statistics: Theov and Methods, 13 : 1275-1288. Martin, R. 1990. Spatial statistical processes in geographic modellingin Spatial Statistics: Past, Present, and Future, edited by D. Griffith, pp. 109-127. Ann Arbor, MI: IMaGe. Rathbun, S. 1994. Spatial modelling in irregularly shaped regions: kriging estuaries. paper presented at the annual joint statistical meetings,August 13-18, Toronto, Ontario, Canada. Rey, S., A. Getis, and A. Bcrtman. 1994. Spatial modeling approaches for the estimation of suppressed geo-referenced data. paper presentedat the 4 1st North American meetings of the Regional Science Association International, Niagara Falls, Canada, November 1720. Ripley, B. 1981. Spatial Statistics. NY: Wiley. Whittle, P. 1954. On stationary process in the plane. Biometrika, 41: 434-449. BIOGRAPHICAL SKETCH Dr. Daniel A. Grifith is currently Chair, Department of Geogramy at Syracuse University. He holds a Ph.D. in Geography from the University of Toronto (1978), an M.S. in Statistics from The Pennsylvania State University (1985), an M.A. in Geography (1972) and a B.S. in Mathematics, both fiom Indiana University of Pennsylvania. Lany J. Layne is a Ph.D. candidate at the College of Environmental Sciences and Forestry, S.U.N.Y. He holds an M.S. in Wildlife and Fisheries from South Dakota State University (1988) and a B.S. in Wildlife from Humboldt State University (1983). Philip G. Doyle is a senior in the Department of Geogramy at Syracuse University and a G.I.S. Intern for Niagara Mohawk Power Corporation.