Lecture #2: quantitative regionalization and cluster detection, with special reference to local statistics Spatial statistics in practice Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Botanical Garden Topics for today’s lecture • Multivariate grouping, and location-allocation modeling. • Going from the global to the local: variability and heterogeneity. • Impacts of spatial autocorrelation on histograms. • The LISA and Getis-Ord statistics. • Cluster analysis: multivariate analysis, cluster detection, and spider diagrams. – An overview of geographic and space-time clusters. • Regression diagnostics and geographic clusters Multivariate grouping goals • If groups are unknown, to identify the latent natural groups of areal units • If groups are known, to assess similarities and differences among the groups • To determine the group centroids and groups of geographical points that result from minimizing some function of standard distance Conventional cluster analysis distances to minimize • Single linkage – distances are measured between pairs of closest (nearest neighbor) areal units, one from each of two clusters, in attribute space • Complete linkage – distances are measured between pairs of most distant (furthest neighbor) areal units, one from each of two clusters, in attribute space – This criterion often gives the best grouping results • Average linkage – distances are measured between all possible pairs of areal units, one from each of two clusters, in attribute space, and then averaged • Centroid method – squared distances are measured between each areal unit and all cluster means, in attribute space • Ward’s algorithm – based upon ANOVA, areal units are allocated to clusters in order to minimize within cluster variances, and maximize between cluster variance – This criterion relates to location-allocation Contemporary cluster analysis criteria • One- or two-stage density – areal unit groupings are based upon nonparametric probability density estimation (kth nearest neighbor, uniform kernel, Wong’s hybrid); utilizes single linkage • EML (equal variance maximum likelihood) – areal unit groupings are based upon maximizing the likelihood of mixtures of identical spherical multivariate normal distributions, possibly with unequal mixing proportions (i.e., sampling probabilities) • Flexible-beta – areal unit groupings are based upon a weighting involving scalar beta, which usually falls between 0 and -1 (a common default value is -0.25, with -0.5 appearing to be more suitable for data with many outliers) • McQuitty’s method – areal unit groupings are based upon weighted average linkage, the weighted pair-group arithmetic averages • Gower's median method – areal unit groupings are based upon weighted pair-group centroids, where distance may or may not be squared Ward’s algorithm and locationallocation P n LA MIN : λ ijw i (u i U j ) 2 (v i Vj ) 2 j1 i 1 where w i is the weight for location i (u i , v i ) is the coordinate for location i (U j , Vj ) is the coordinate for centroid j λ ij 0/1, depending upon wheth er or not location i is allocated to cluster j Ward' s algorithm cluster th e u and v variables (standard distance criterion) with a _FREQ_ variable (weights) Clustering with PCA/FA • Although PCA & FA are used most frequently to deal with multicollinearity across attribute variables (R-mode), these techniques also can be used to handle redundant information across areal units (Q-mode; e.g., the eigenfunctions of geographic weight matrix C) • Linear combinations extracted from matrix (I-11T/n)C(I-11T/n) or (I-11T/n)D*(I-11T/n) identify the range of possible distinct map patterns (i.e., uncorrelated and orthogonal) Legendre et al. method • A comparison of the two procedures is in • Links directly to the semivariogram plot • D* is a truncated distance-based matrix, where the truncation is determined by the length of a minimum spanning tree articulating the set of locations n Ej is the map pattern with spatial MC j T λ j autocorrelation level MC j 1 C1 Properties • The extreme eigenvalues define MCmax and MCmin (not necessarily 1, -1) • As eigenvalues go from the largest positive to the largest negative value, map patterns become more fragmented • Positive eigenvalues denote: – Global trends with relatively large values – Regional trends with intermediate values – Local trends with relatively small values Selected ideal map patterns global MC ~ 1 regional MC = 0.9 regional MC = 0.5 MC = 0.7 local MC = 0.25 MC = -0.6 SA impacts on Gaussian RVs Principal impact: variance inflation SA map pattern: MC = 1.12, GR = 0.08 standard normal curves MC = 1.00 GR = 0.18 MC = 0.00 GR = 1.00 MCmax = 1.18 increased kurtosis MC = 0.28 GR = 0.77 heavier tails Unstandardized normal curve autoregressive generated Kurtosis increases from 0.01 (roughly 0) to 0.73. The variance of kurtosis is 24/n. Therefore, here spatial autocorrelation has induced increased relative peakedness (from the sign of the kurtosis statistic) whose z = 7.3. map pattern generated Kurtosis increases from 0.04 (roughly 0) to 2.79. The variance of kurtosis is 24/n. Therefore, here spatial autocorrelation has induced increased relative peakedness (from the sign of the kurtosis statistic) whose z = 27.8. Typical case: MC/MCmax = 0.6 map pattern MC = 0.61 GR = 0.50 map pattern MC = 0.80 GR = 0.34 attribute correlations E3 0.004 E4 0.002 0 X E3 E(MC) = -0.00042 E(GR) = 1 Transformations to normal approximations Torturing the data – conforming to a bellshaped curve 1. Box-Cox power transformations Y* (Y δ) , γ 0 Y * LN(Y δ), γ 0 γ 2. Manly’s exponential transformation Y* e γY 3. Percentage adjustments (also arcsine) (Y a)/(T b) LN δ 1 (Y a)/(T b) China data example: births/females Y* (B/F15-44 0.04) 0.43 China data example: pop/area Y* LN(P/A 279) China data example: births/deaths Y* e 0.24B/D A China example: % F15-44 (Y a)/(T b) LN δ 1 (Y a)/(T b) empirical probability mean min median max F/P 0.247 0.193 0.247 0.416 (F+a)/(P+b) 0.270 0.216 0.257 0.407 (1-c)(F+a)/(P+b)+c 0.168 0.107 0.153 0.324 Constant variance • Attribute: variable transformations often stabilize the variance of a variable across its measurement range • Mean/median split gives a heuristic assessment of constant variance (equal variability of high and low values) Constant variance • Geographic: variable transformations often stabilize the variance of a variable across the geographic landscape over which it is distributed • Quadrants of the plane/established areal unit groupings give a heuristic assessment of constant variance across a geographic landscape Plane quadrants provinces Non-normal random variables (RVs) • Poisson: the mean equals the variance (built-in heterogeneity) – overdispersion: the variance is greater than the mean – assuming a gamma-distributed mean results in a negative binomial random variable • binomial: variance equals (1-p) times the mean [i.e., Np(1-p)] – overdispersion: the variance is greater than Np(1-p) – employ a quasi-likelihood estimation Spatial autocorrelation impacts on Poisson RVs overdispersion occurs when: var(Y) > μ μ5 σ 5 2.2361 weak positive spatial autocorrelation iid x 4.9560 s 2.2512 x 4.9930 s 2.5874 x 5.0045 s 6.9475 strong positive spatial autocorrelation Impacts of typical spatial autocorrelation levels x 4.9914 s 3.1007 hexagonal tessellation x 4.9875 s 4.0098 Poissonness plots irregular tessellation Spatial autocorrelation impacts on binomial RVs global autoregressive global & regional • variance increases • shape goes to uniform, then to sinusoidal global & regional & local Going from the global to the local Paralleling statistics concerning data outliers, and leverage and influential points, spatial heterogeneity in georeferenced data is addressed by focusing on individual areal units. The emphasis shift is from global trends to local exceptions, to better understand local deviations from global model descriptions by exploiting tensions between global trends and informative local details latent in empirical data: • adaptation of conventional diagnostic statistics (e.g., Unwin and Wrigley, 1987) • spatializing existing statistical techniques (e.g., Fotheringham et al., 2002) • Anselin’s (1995) seminal paper about indices of spatial association (i.e., LISA statistics) • Getis and Ord’s (1992, 1995) Gi and Gi* statistics Goals of global versus local analysis • Identify clustering • Identify particular clusters (significant local clusters in the absence of global autocorrelation) • distinguish between homogeneity and heterogeneity (e.g., spatial outliers - highs surrounded by lows, and vice versa) • identify hot/cold spots • analyze local instability (local deviations from global pattern of spatial autocorrelation) LISA: local indicators of spatial autocorrelation n MC n n n c (y i 1 j1 n ij i y)(y j y) n cij (y i y) i 1 j1 n 2 i 1 1 n z c z i 1 n cij i 1 j1 n Moran scatterplo t : c ijz Y, j versus z Y,i j1 selevation n n i c ij j1 n z Y,i cijz Y, j LISA : j1 n area competition (n 1)/n ni E(LISA) ; VAR(LISA) from randomizat ion ni 1 Y,i j1 ij Y, j (n 1)/n Goal: to assess spatial correlation heterogeneity color ANOVA F = 2845 Pr(>X2) = 0.4 # counties z-score range dark green Light green gray Light red Dark red 16 2016 200 145 14 -3.6 – 1.2 -1.2 – 3.25 3.25 – 6.4 6.4 – 9.6 9.6+ LISA z-score The randomization perspective Conditional randomization Step1 : hold z Y,i constant Step 2 : randomly select n i of the remaining (n - 1) z - scores R times n Step 3 : compute I r z Y,i cijz Y, j , and then the variance (s I ) for the R replicatio ns j1 n Step 4 : compute z i z Y,i c ijz Y, j n i /(n i 1) j1 sI Step 5 : compare the probabilit y of z i with Bonferroni adjusted (α /n) and/or Sidak adjusted [1 - (1 - α)1/n ] multiple test probabilit ies LISA for PR LN(elevation + 17.5) Pr(LISA) Bonferroni Sidak slope is unstandardized MC MC = 0.51; GR = 0.49 LISA maps significant LISA • Cannot distinguish between H-H and L-L clusters • Conventional clustering fails to preserve contiguity … with contiguity proclivity Clustering geocoding coordinate pair coupled with zLISA values Clustering geocoding coordinate pair with frequencies proportional to zLISA values Getis-Ord Gi [ Gi* includes i (i.e., j = i)] • contiguity based upon distance band defined by dr • dr may be obtained from a semivariogram plot • one statistic for each areal unit n c (d )y j1 G i (d r ) ij r n j y (i) cij (d r ) j1 n (n 1) c (d r ) cij (d r ) j1 j1 n2 n 2 , ji 2 ij s (i) • Gi(dr) > 0 signifies clustering of high values • Gi(dr) < 0 signifies clustering of lows values • LISA fails to make this particular distinction A Gi-based analysis: complete linkage Gi Gi clusters A relationship between LISA and Gi for the same geographic connectivity matrix C The quadratic trend is why LISA cannot distinguish between HH and LL clusters, while Gi can. geographic & space-time clusters: an overview • Global cluster tests search for spatial clusters anywhere in a study area but do not necessarily identify where the clusters occur, and are used to identify departures from spatial randomness when overall spatial pattern is considered. • Local cluster tests identify locations at which there is some excess/deficit—a hot/cold spot— anywhere within a study area. • Focused cluster tests determine whether there is an excess near a pre-specified location, called a focus, and are used to detect clustering near, say, putative hazards (e.g., a toxic waste dump). Cluster detection techniques Spider diagrams • allocation to AAR centroids • allocation to cluster (U, V, z-LISA) centroids Regression diagnostics: each observation’s influence on parameter estimates and predicted values • PRESS – global measure that should roughly equal the mean squared error (MSE) for a trend line (equivalent to cross-validation) • Leverage – measures degree of influence of areal unit CzY,i value on an MC trend line (marked: > 2/n) • Studentized residual – measures whether ith areal unit causes a significant shift in its corresponding regression intercept (i.e., is an outlier; marked: > 2) • Cook’s D – measures influence of ith areal unit on an MC estimate (analogous to DFFITS; 1 marked: > 2 n ) Moran scatterplot for LN(elevation + 17.5) RMSE 1.94759 PRESS/n 1.97055 marked values mean of C1 Barranquitas is a spatial outlier, again! Spatial autocorrelation in diagnostic statistics: eigenvector covariates MC(E2) = 1.04926 Dark red: very high Light red: high Gray: medium Light green: low Dark green: very low α 0.01; * α adj 0.0014 zLISA 2 H rstudent DFFITS 2 2* 2 4* 6 13 23 25 25 69 MC = 1.04926 R2 0.439 0.274 0.303 0.109