Lecture #1: a general overview of spatial autocorrelation in georeferenced data Spatial statistics in practice Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Botanical Garden ADMINISTRATION Who knows what linear regression is? Who knows what nonlinear regression is? Who knows what generalized linear modelling is? Who knows what spatial autocorrelation is? Does it mean redundant information? Does it mean variables missing from a model? Does it mean spatial dependency? Does it mean a nuisance parameter is present? Topics for today’s lecture • The nature of data and its information content. • What is spatial autocorrelation? • Visualizing spatial autocorrelation: Moran scatterplots, semivariogram plots, and maps. • Defining and articulating spatial structure: topology and distance perspectives; contagion and hierarchy concepts. • Necessary concepts from multivariate statistics. • An example of the elusive negative spatial autocorrelation. • Some comments about spatial sampling. • Implications about space-time data structure. High Peak district biomass index: ratio of remotely sensed data spectral bands B3 and B4 Spatially autocorrelated Geographically random The permutation perspective The magic box is a physical model of spatial autocorrelation Slider puzzles The nature of data and its information content • Data – symbols (e.g., numbers), signs, impulses, and the like, which in and of themselves have no inherent meaning • Information – [from inform (to tell facts)] data to which meaning has been attached, converting numbers, for example, to messages (i.e., facts) about circumstances, situations, events, and so on • Evidence – information analytically converted to knowledge about whether or not some proposition is true or valid Berry’s geographic matrix location time attributes Variable 1 Variable 2 … Variable P attributes areal unit 1 location 1 Variable 2 … Variable P areal unitVariable 2 attributes .1 areal unit location . 1 Variable 2 … Variable P .2 areal unit Variable geographic .1 areal unitareal unit n associations . areal unit .2 geographic . areal unit n distribution . geographic . areal unit n fact The nature of georeferenced data and its information content • Geocoded data – data that are tagged to specific points on a two-dimensional surface (e.g., latitude and longitude for the Earth) • Geographic information – data to which meaning has been attached in part through their absolute and/or relative geographic contexts • Geographic evidence – information analytically converted to knowledge in part by accounting for the presence of latent spatial autocorrelation What is needed for spatial statistics: 1. An attribute data set 2. A map 3. Locational tags linking the attribute data set to the map 4. A topological structure matrix depicting the arrangements of locations on a map Spatial autocorrelation can be interpreted in different ways As a spatial process mechanism – spatial diffusion As a diagnostic tool – the Cliff-Ord Eire example (the model specification should be nonlinear) As a nuisance parameter – eliminating spatial dependency to avoid statistical complications As a spatial spillover effect – georeferencing of pediatric lead poisoning cases in Syracuse, NY As an outcome of areal unit demarcation – the modifiable areal unit problem (MAUP) As redundant information – spatial sampling; map interpolation As map pattern – spatial filtering (to be discussed in this course) As a missing variables indicator/surrogate – a possible implication of spatial filtering As self-correlation – what is discussed next Defining spatial autocorrelation Auto: self Correlation: degree of relative correspondence Positive: similar values cluster together on a map Negative: dissimilar values Cluster together on a map USA: understanding spatial autocorrelation POLLUTION MONITORING HOUSEHOLD SAMPLING SATELLITE IMAGE AGRICULTURAL EXPERIMENT The SASIM game http://www.nku.edu/~longa/cgi-bin/cgi-tcl-examples/generic/SA/SA.cgi The nature of data and its information content: analysis tools • Scatterplot: a two-dimensional visual portrayal of the relationship between two variables that aids in interpreting a correlation coefficient or linear regression trend line • Correlation: a global summary measure (ranging between -1 and 1) that indexes the degree to which a regression trend line characterizes its corresponding scatterplot • Linear regression: a technique used to determine the straight line trend exhibited by a scatterplot Graphic examples r=1 r = 0.95 perfect positive marked positive r = 0.26 r = 0.51 moderate positive regression trend line in blue r = 0.72 r = 0.01 weak positive r = -0.71 strong negative strong positive trace positive r = -1 perfect negative Describing a scatterplot trend positive relationship: High Y with High X & Medium Y with Medium X & Low Y with Low X negative relationship: High Y with Low X & Medium Y with Medium X & Low Y with High X Georeferenced data scatterplots • The horizontal axis is the measurement scale for some attribute variable • The vertical axis is the measurement scale for neighboring values (topological distance-based) of the same attribute variable OR • The horizontal axis is (usually) Euclidean distance between geocoded locations • The vertical axis is the measurement scale for geographic variability Description of the Moran scatterplot Positive spatial autocorrelation 2002 population - high values tend to be density surrounded by nearby high values - intermediate values tend to be surrounded by nearby intermediate values - low values tend to be surrounded by nearby low values MC = 0.49 GR = 0.58 Description of the Moran scatterplot Negative spatial autocorrelation competition for space - high values tend to be surrounded by nearby low values - intermediate values tend to be surrounded by nearby intermediate values - low values tend to be surrounded by nearby high values MC = -0.16 sMC = 0.075 GR = 1.04 Tasseled cap SBI (soil brightness index) data for the High Peak District SBI = 0.332B1 + 0.331B2 + 0.552B3 + 0.452B4 + 0.481B5 + 0.252B6 TSBI = [SBI – 79 + 168(rSBI/900)3]0.6 DEM data for Puerto Rico Elevation mean by municipio Elevation standard deviation by municipio Population density across the Cusco Department of Peru, by district raw population densities LN+ population densities As distribution across a polluted geographic landscape # ## # # # # # # # # # # # # # ## ## # # # ## # # # # # # # # ## ## # ## # ## # # # ## # ## # # # # # # # # # # # # # # # ### ## # # # # ### # ### ## # # # # # # ### # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ### ## ## # # # # # # # # # # # # # ## # # # # # # # # # # # ## # # # # # ## # # # # # # # # # # # # ## # # # # # # # # # # # # # # # # # # # # # # # # # # # ## # # # # # # # ## # # # # # Murray f(As) - continued # # # # ## # # # # # # # # ## # # # ## # # # # # # # # ## # # # ## # # # # # # # ## # # # ## # # # # # # # # # # # # # # # # # # # # # # # # ## # # # ## # # # # # # # # # # ## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #### #### # # ##### ## ## ## # # # # # # # # # # # # # # ## # # # # # # # # # # # # # # # # # # # # # # # # # # # ## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ## # # # # # # ## # # # # # # # # # # # # # n 1(y i y)(x i x)/n Spatial autocorrelation i 1 n (y n i y) i 1 n 2 (x i i 1 n n from r to MC x) 2 n c (y ij n i y)(y j y)/ i 1 j1 n n c i 1 j1 n (y i y) 2 i 1 (y i y) 2 i 1 n n ij Measures of spatial autocorrelation MC: Moran Coefficient; GR: Geary Ratio; semivariogram range: -1, -1/(n-1), 1 MC n n n (y c (y ij j y) j1 c ij (y i y) 2 i 1 range: 0,1, 2 n 2 i 1 j1 c ij (y i y j ) 2 i 1 j1 n n c ij i 1 γ(d k ) nk (y i y j ) i 1 n n -1 n y) n i 1 j1 GR i i 1 n n (y i y) 2 2*n k 2 f( d k ), k 1,2,..., K Spherical Exponential Bessel function (1st order, 2nd kind) Graphical portrayals of spatial autocorrelation latent in the transformed As data Adair County, MO, 1990 population density. Upper left: geographic connectivity matrix. Upper right: geographic distribution of density. Lower left: Moran scatterplot. Lower right: semivariogram plot & Bessel function. Salem County, VA, 1990 population density. Upper left: geographic connectivity matrix. Upper right: geographic distribution of density. Lower left: Moran scatterplot. Lower right: semivariogram plot & wave-hold function. Tessellations (i.e., surface partitionings) square rook hexagon queen spatial autocorrelation field extents 1st order rook 2nd order hexagon queen Geographic structure from surface partitioning topology • Create an n-by-n table whose rows and 1 10 2 2 0 columns are labeled with the same … sequence of n locations n • In response to the question “Is the location labeling a row adjacent to the location labeling a column?”, code the (row, col) table cell 0 if the answer is “No” and 1 if the answer is “Yes.”—this is matrix C. • Divide each cell entry of matrix C by its corresponding row sum to create matrix W. … n 0 0 The binary and row-standardized geographic connectivity matrices surface partitioning Denotes adjacency A B C D matrix C associated geographic connectivity/ weights matrix matrix W A B C D A 0 1 1 0 A B C D A 0 B 1 0 0 1 C 1 0 0 1 D 0 1 1 0 B C D 0.5 0 0 0.5 0.5 0 0 0.5 0 0.5 0.5 0.5 0.5 0 0 Geographic structure from spatial flows geographic hierarchy twodimensional manifestation of geographic hierarchy P P c ij I ij κ e α β i j γ d ij w ij β -γ d ij j Pe n P e j1 β - γ d ij j Geographic structure from interpoint distances distance lags anisotropy The semivariogram plot is a scatterplot of average squared paired comparisons versus average interpoint distance for grouped georeference data Quadrats of a Moran scatterplot Q1 (values [+], sum of neighboring values [+]): H-H Q3 (values [-], sum of surrounding values [-]): L-L CZ Q2 Q1 Locations of positive spatial association (“I’m similar to my neighbors”). 0 Q3 Q2 (values [+], sum of neighboring values [-]): H-L Q4 0 z Q4 (values [-], sum of neighboring values [+]): L-H Locations of negative spatial association (“I’m different from my neighbors”). Syracuse population density ON VT II 8811 NY Syracuse MA CT PA NJ Syracuse Population Density Very High High Medium Low Very Low Interstate State Boundary Canadian Border II 448811 II66 9900 II 81 81 II 9900 MC = 0.70 GR = 0.29 90 II 90 Moran scatterplots MC = 0.28 GR = 0.68 MC = 0.64 GR = 0.37 MC = 0.70 GR = 0.29 population density; % widowed % male; % with univ. degree MC = 0.41 GR = 0.65 MC = 0.23 GR = 0.82 black/white ratio MC = 0.48 GR = 0.63 Houston population density MC = 0.53 GR = 0.43 Moran scatterplots MC = 0.53 GR = 0.43 MC = 0.64 GR = 0.37 MC = 0.36 GR = 0.64 population density; % widowed % male; % with univ. degree black/white ratio MC = 0.53 GR = 0.46 MC = 0.57 GR = 0.43 MC = 0.71 GR = 0.29 What is the spatial autocorrelation here? raw births/deaths transformed MC = 0.65 GR = 0.32 What is the spatial autocorrelation here? raw % 100+ years old transformed MC = 0.42 GR = 0.56 What is the spatial autocorrelation here? raw births/females15-44 transformed MC = 0.63 GR = 0.28 Geostatistical terms • Valid semivariogram model: resulting covariation matrix is positive-definite • (Effective) range: the distance at which spatial autocorrelation becomes (effectively) zero • Sill: the covariation that a semivariogram tends to when distance becomes very large • Nugget: a discontinuity at distance = 0 • Isotropy: spatial dependency changes only with inter-point distance, not direction • Anisotropy: spatial dependency changes with inter-point distance and direction Semivariogram modeling for population density across China 5,659,641 distance pairs Too much global structure!!! 30% accounted for by a linear gradient; + 10% accounted for by a quadratic gradient. Semivariogram model descriptions: poor trend line fits! 5 Semivariogram model descriptions after geographic detrending good description Redundant information • Much analysis is devoted to analyzing redundant (i.e., duplicate) information • Conventional analysis: the degree to which a scatter of points, for example, aligns along a straight line (multicollinearity) • Spatial statistical analysis: the degree to which attribute values in one location replicate the information content of attribute values in neighboring locations Review: multivariate analysis http://obelia.jde.aca.mmu.ac.uk/multivar/intro.htm Multivariate regression & general linear model N-way ANOVA Principal Components & Factor Analysis Cluster analysis Generalized linear model Multivariate analysis: assumptions 1. Linearity 2. Constant variance 3. Normality (bell-shaped curve) 4. Independence (e.g., zero spatial autocorrelation) 5. Random sampling (design based)/ stochastic model (model based) error 6. No measurement error 7. No multicollinearity ANOVA Goal: divide observations into groups, and assess the within- and between-groups variation. Use: evaluating differences of group means Accompanying tests for homogeneity of variance: Levene – if attribute is continuous Bartlett – if attribute is normally distributed within groups Regression analysis • The workhorse of conventional statistics: almost any classical statistical technique can be expressed as a linear regression problem • Y Xβ ε , with most assumptions attached to the error term, most notably the Gaussian distribution • Parameter estimates often OLS or MLE • Has been extended to nonlinear and genrealized linear model specifications Nonlinear & generalized linear models • Nonlinear regression often employs a normally distributed error term; its parameters usually are estimated with MLE • Generalized linear models refer, in part, to Poisson and Binomial regression models; their parameters often are estimated with weighted least squares Some guidelines • General interval/ratio data – linear regression • Counts data – Poisson regression • Percentage/binary data – binomial/logistic regression • Quick approximations for counts/percentage data – transformed response for linear regression • Complicated multiplicative specifications – nonlinear regression • Rankings – nonparametric techniques Principal components analysis (PCA) • A critical concept in PCA is the eigenfunction • Orthogonal components of a matrix are calculated, and then used to determine uncorrelated linear combinations of original variables det( R λI ) 0 (R λI )E 0 • λ are the eigenvalues • E are the eigenvectors Theoretical eigenfunction properties P1: all of the eigenvalues and eigenvectors of a real symmetric matrix consist of real (rather than complex or imaginary) numbers P2: the eigenvalues of a matrix and its transpose are the same P3: the sum of the eigenvalues of a matrix equals the sum of that matrix's principal diagonal elements (i.e., its trace) P4: if some constant b is added to each element in the diagonal of a matrix, then b is added to each of the eigenvalues of that matrix, but the matrix's eigenvectors remain unchanged P5: if a matrix is multiplied by some scalar constant, then its eigenvalues also are multiplied by this constant, but its eigenvectors remain unchanged P6: the product of a matrix's eigenvalues equals the value of that matrix's determinant P7: the eigenvalues of a matrix and its inverse are inverses of each other, while the eigenvectors are the same P8: if a matrix is powered by some positive integer value, each of its eigenvalues is powered by this same positive integer value, but its eigenvectors remain unchanged Eigenfunction properties - continued P9: for a symmetric matrix, two eigenvectors associated with two distinct eigenvalues are mutually orthogonal (i.e., EhTEk = 0, h k) P10: for a real symmetric matrix, the transpose of the eigenvector matrix extracted from it equals the inverse of this eigenvector matrix (i.e., ET = E-1) P11: the eigenvalues of a triangular or diagonal matrix are the elements in its principal diagonal P12: the principal (i.e., largest or dominant) eigenvalue of a matrix is contained in that interval defined by the largest and smallest row sums for this matrix, where these sums are of the absolute values of the row cell entries P13: the principal eigenvector of a nonnegative, symmetric matrix has all nonnegative values. P14: the principal eigenvalue of a matrix is positive, and no other eigenvalue of this matrix is greater in absolute value P15: the sum of the squared eigenvalues of a matrix is less than or equal to the sum of all of the elements of this matrix, where these sums are of the absolute values of the cell entries, with exact equality achieved for binary matrices Cluster analysis Goal: to uncover latent natural groups of areal units Popular methods: single linkage complete linkage average linkage Ward’s algorithm (ANOVA-based) centroid Spatial analysis usually wants to include a contiguity constraint. A surprise! Hidden negative SA Dark red: very high Light red: high Gray: medium Light green: low Dark green: very low Standard spatial statistical model specifications fail to account for all of the positive SA displayed in the China population density map because of hidden negative SA! map of hidden negative SA pop density SF residuals SAR residuals hidden negative SA SPATIAL SAMPLING • Increasing domain sampling • Infill sampling Increasing domain sampling design Infill sampling design to estimate a single regional mean: Murray, Utah The two extreme hexagonal tessellations employed in the sampling design experiment: left n = 104; and, right - n = 2008 The Syracuse, NY, pediatric lead poisoning study The two extreme hexagonal tessellations employed in the sampling design experiment: left - n = 70; and, right - n = 2001 N N # N # # N N N # N N N N N # N N # N N N N # NN # N N # N N # N N N N # N # N N# N # N N N N N # NNN N N N N # N N # #N # # N # N N N # N N # N N # N N N N N # N N N N # #N N NN N N N N # # N N# N# N N #N N # N N NN N # N N N N N # N N NN # # # N # N N # N N N # N N # N N N N N # N N N N # N # N N # # # # N # N N N # N # N N# N N N N # N # N N # N # N # N # NN N # N N N N # N N #N N# # N N N N #N # N N # N # N N #N N N# NN N N N N 1 E( σ̂ ) σ 1 V 1/n , 2 y 2 T ε 2 V I E( σ̂ ) σ /n , 2 y 2 ε SAR : V (I ρW )(I ρW ) T 1 TR( V ) 2 σε 2 n E( σ̂ y ) 1 TR( V ) n T 1 1 V 1 1 TR( V ) n* T 1 n 1 V 1 VIF Effective sample size: the # of equivalent iid observations A general formula via the SAR model n* 1 n 1 2.12373 ρ̂ 0.20024 ρ̂ n 1 (1 e ) 1.92369 n 1 e ρ̂ 0 n* n ρ̂ 1 n* 1 A general formula via semivariogram models spherical 1 n -1 1 251.5132( r d max K-Bessel: 1st order, 2nd kind exponential 1 )1.9324 n -1 (1 51.4879 r d max 1 )1.7576 n -1 (1 69.6698 n* is a function of the range/range parameter of a semivariogram model, as well as the model itself. r d max )1.8601 A general formula via spatial filter models Based on the variance inflation factor notion, where R2 is from regressing Y on a spatial filter [i.e., a linear combination of eigenvectors extracted from (I-11T/n)C(I-11T/n)]: σ 2 2 σε 2 1 R E( σ̂ y ) 2 n (1 R )n 2 ε n* (1 R )n 2 sample size, n, determination SAR ( z1 - α/2 z1 - β ) 2 1 e 2.12373 ρ̂ 0.20024 ρ̂ σ e2* Δ2 1 e 1.92349 1 Bessel 1 e 1.92349 ( z1-α/2 z1-β ) 2 r c 1 (C0 C1) 1 (1 b ) 2 d max Δ σ e2 * SF 1 e 2.12373 ρ̂ 0.20024 ρ̂ ( z1 - α/2 z1 - β ) 2 2 Δ 1 R 2 power precision A peek at space-time autocorrelation Yt = f(WYt-1) V U location i at time t Space-time treatments are beyond the scope of this course Some observations: 1. The temporal component tends to be the stronger one [a one-directional (i.e., single), one-dimensional influence] 2. As the average number of neighbors increases, maximum spatial autocorrelation tends to decrease. 3. Analysis of a single map implies that spatial effects occur instantaneously.